sparse data - Githubissues

mikapfl commented 1 year ago

we need more efficient handling of sparse data. Thinking about sparseness for every data set is worse than just having algorithms which handle sparse data.

In xarray (https://docs.xarray.dev/en/stable/internals/duck-arrays-integration.html?highlight=sparse#integrating-with-duck-arrays and https://github.com/pydata/xarray/blob/main/xarray/tests/test_sparse.py) this functionality is still experimental, but it could be pretty drop-in. Maybe we should see if we can make a primap array sparse and see if our tests still pass, that would be great!

JGuetschow commented 1 year ago

I set high priority for the test described above. If it's more work priority is medium as currently I can usually avoid memory issues by reducing dimensionality early.

mikapfl commented 1 year ago

Because sparse data was not originally a design goal for primap2, I expect it to be quite some work. If the xarray drop-in I highlighted above works, we still get the maintenance burden of using an experimental functionality in production. If the drop-in doesn't work, we probably have to fix things in xarray itself.

I think it would be reasonable to test the experimental functionality a bit, get a feel for how experimental it truly is, and then decide what we will do.

mikapfl commented 1 year ago

Another strategy for solving this problem is a “divide and conquer” approach. We do this already when processing e.g. by country so that dimensions don't blow up. There is now an interesting project within the xarray ecosystem to provide tools to do exactly that: https://github.com/xarray-contrib/datatree . One of their examples is CMIP6, where each model has slightly different variables and different resolution etc. Using datatree, the data is grouped into one datatree consisting of multiple xarray datasets, where each dataset uses its own dimensions, but operations on the whole datatree are still possible.

Worth to check out if this is a viable solution for us.

JGuetschow commented 1 year ago

Definitely worth looking at. It would introduce a further distinction between types of metadata (nodes (e.g. source, scenario, and model), variables (entity), dimensions (category, time, ...)) making some functionality more complicated to implement, but keeping large datasets as one object without having to manually split datasets might make up for that.

mikapfl commented 1 year ago

At least for the "primap database" with all bells and whistles, with data from CRF, EDGAR, etc. all in the database, we need something which is higher-level than an xarray dataset anyway, I think. But that is not a current problem (while the sparseness problem is a current problem).

znicholls commented 7 months ago

Another option to consider: copy/vendor something out of scmdata.

These docs in scmdata show we deal with the sparsity issue in a slightly opaque way (search for 'sparse' to find the most relevant section) https://scmdata.readthedocs.io/en/latest/notebooks/netcdf.html#splitting-your-data

A better example is probably the below. The key output is that the array is 10 times smaller when you make it sparse and has 11 times fewer nan's when it's non-sparse. I think we should consider how we can do the same trick with primap data to avoid the memory issues we've discussed. scmdata implementation code is here https://github.com/openscm/scmdata/blob/7813392bf974ce9454975423518c9249200cd1f9/src/scmdata/_xarray.py#L11C45-L11C51 (lots of steps, but not terribly written)

We'd have to test how it actually works, but it might 'just work' as xarray does some smart things with collapsing dimensions and multi-indexing in my experience.

Example code with scmdata

mikapfl commented 7 months ago

@znicholls Nice, thanks

just to understand the strategy, I'll try to summarize it and you can tell me if I misunderstood:

We identify the sparse dimensions. Generally, a sparse dimension is one where if you project this dimension out, the relative sparsity of the array decreases significantly.
Then, we identify all combinations of the sparse dimensions which are actually filled with data.
We assign a primary key to these data points, the important thing is only that the values are unique, otherwise we don't care for the contents of the primary key. Small integers are probably a good choice for readability, but UUIDs would also be fine technically.
We now use this primary key as the dimension in the xarray.DataArray, and demote the other dimensions it replaces to mere coordinates which use the new primary key as their dimension.

Right?

That's quite nifty. If you use set_xindex() to add an index on the primary key, you can even select efficiently and easily on the sparse coordinates:

In [16]: non_sparse_xr = non_sparse_xr.set_xindex("paraset_id")

In [17]: non_sparse_xr.loc[{"paraset_id": 8}]
Out[17]: 
<xarray.Dataset>
Dimensions:                         (time: 100, run_id: 9, scenario: 3)
Coordinates:
  * time                            (time) object 2000-01-01 00:00:00 ... 209...
  * run_id                          (run_id) int64 24 25 26 54 55 56 84 85 86
  * scenario                        (scenario) object 'ssp119' 'ssp370' 'ssp585'
  * paraset_id                      (run_id) int64 8 8 8 8 8 8 8 8 8
Data variables:
    Atmospheric Concentrations|CO2  (scenario, run_id, time) float64 nan ... nan
    Radiative Forcing               (scenario, run_id, time) float64 nan ... ...
    Surface Temperature             (scenario, run_id, time) float64 0.0 ... nan
Attributes:
    scmdata_metadata_region:  World
    scmdata_metadata_model:   example

This could honestly be a low-effort route that we can drop in with only 1 or 2 days of work and see how it works out in daily usage.

mikapfl commented 7 months ago

@JGuetschow do you have a nice sparse dataset I can play with? I can of course easily build an artificial sparse dataset, but if you happen to just have something around which is real data and sparse, I could use that and be closer to real usage than with something artificially generated.

JGuetschow commented 7 months ago

Any of the output/BUR_NIR_combined_all_XXXX datasets in the UNFCCC_data repo (it's private but you should have access)

znicholls commented 7 months ago

Right?

You got it (and whether that primary key is another dimension or something we add automatically doesn't really matter but was something that was a bit of mucking around to automate from memory).

If you use set_xindex() to add an index on the primary key, you can even select efficiently and easily on the sparse coordinates

That's cool. Maybe xarray has thought about this problem too and has an even better solution as that API didn't exist when we first wrote it I don't think and it seems quite specific to such handling...

This could honestly be a low-effort route that we can drop in with only 1 or 2 days of work and see how it works out in daily usage

Nice

mikapfl commented 5 months ago

So I played around a little with this while waiting for Excel things in ndcs land.

It works in principle, but there are definitely some things to work out still:

The auxiliary dimension has to remain visible as a coordinate, otherwise alignment will not work at all. Also it has to be a pandas MultiIndex if I understand things correctly.
This means we have to standardize the naming of the auxiliary dimension and allow it in the data format. Not difficult, but work to do.
This also needs special-casing in conversions to interchange format - hardly surprising, but still work to do.
Problematic is that for alignment during arithmetic and similar operations, the auxiliary dimension is used, not the coordinates defined on it. In practice that means combining multiple datasets which both have auxiliary dimensions has to be done with care to make sure the auxiliary dimensions are not overlapping. Maybe we can make a standard which works surprisingly often (something like encoding the values of the real coordinates into the auxiliary dimension) , but in general I fear there might be surprises. Sure, non-dimension coordinates are a standard thing in xarray, but I'm not sure if they are supported as well as proper dimension coordinates.
We also need to think a bit about the point when we define auxiliary coordinates - do it too early, and you have to re-define them when combining multiple datasets (see prior point). Do it too late and you are already wasting lots of memory before conversion.
A central API we would need to see if it can be made to work would be (da.pr.set)[https://primap2.readthedocs.io/en/stable/usage.html#Setting-data].
Stuff like da.pr.loc does not work yet, because it works on dimensions, not coordinates. Probably rather easy to fix.

pik-primap / primap2

sparse data #133