Open mikapfl opened 1 year ago
I set high priority for the test described above. If it's more work priority is medium as currently I can usually avoid memory issues by reducing dimensionality early.
Because sparse data was not originally a design goal for primap2, I expect it to be quite some work. If the xarray drop-in I highlighted above works, we still get the maintenance burden of using an experimental functionality in production. If the drop-in doesn't work, we probably have to fix things in xarray itself.
I think it would be reasonable to test the experimental functionality a bit, get a feel for how experimental it truly is, and then decide what we will do.
Another strategy for solving this problem is a “divide and conquer” approach. We do this already when processing e.g. by country so that dimensions don't blow up. There is now an interesting project within the xarray ecosystem to provide tools to do exactly that: https://github.com/xarray-contrib/datatree . One of their examples is CMIP6, where each model has slightly different variables and different resolution etc. Using datatree, the data is grouped into one datatree consisting of multiple xarray datasets, where each dataset uses its own dimensions, but operations on the whole datatree are still possible.
Worth to check out if this is a viable solution for us.
Definitely worth looking at. It would introduce a further distinction between types of metadata (nodes (e.g. source, scenario, and model), variables (entity), dimensions (category, time, ...)) making some functionality more complicated to implement, but keeping large datasets as one object without having to manually split datasets might make up for that.
At least for the "primap database" with all bells and whistles, with data from CRF, EDGAR, etc. all in the database, we need something which is higher-level than an xarray dataset anyway, I think. But that is not a current problem (while the sparseness problem is a current problem).
Another option to consider: copy/vendor something out of scmdata.
These docs in scmdata show we deal with the sparsity issue in a slightly opaque way (search for 'sparse' to find the most relevant section) https://scmdata.readthedocs.io/en/latest/notebooks/netcdf.html#splitting-your-data
A better example is probably the below. The key output is that the array is 10 times smaller when you make it sparse and has 11 times fewer nan's when it's non-sparse. I think we should consider how we can do the same trick with primap data to avoid the memory issues we've discussed. scmdata implementation code is here https://github.com/openscm/scmdata/blob/7813392bf974ce9454975423518c9249200cd1f9/src/scmdata/_xarray.py#L11C45-L11C51 (lots of steps, but not terribly written)
We'd have to test how it actually works, but it might 'just work' as xarray does some smart things with collapsing dimensions and multi-indexing in my experience.
@znicholls Nice, thanks
just to understand the strategy, I'll try to summarize it and you can tell me if I misunderstood:
Right?
That's quite nifty. If you use set_xindex()
to add an index on the primary key, you can even select efficiently and easily on the sparse coordinates:
In [16]: non_sparse_xr = non_sparse_xr.set_xindex("paraset_id")
In [17]: non_sparse_xr.loc[{"paraset_id": 8}]
Out[17]:
<xarray.Dataset>
Dimensions: (time: 100, run_id: 9, scenario: 3)
Coordinates:
* time (time) object 2000-01-01 00:00:00 ... 209...
* run_id (run_id) int64 24 25 26 54 55 56 84 85 86
* scenario (scenario) object 'ssp119' 'ssp370' 'ssp585'
* paraset_id (run_id) int64 8 8 8 8 8 8 8 8 8
Data variables:
Atmospheric Concentrations|CO2 (scenario, run_id, time) float64 nan ... nan
Radiative Forcing (scenario, run_id, time) float64 nan ... ...
Surface Temperature (scenario, run_id, time) float64 0.0 ... nan
Attributes:
scmdata_metadata_region: World
scmdata_metadata_model: example
This could honestly be a low-effort route that we can drop in with only 1 or 2 days of work and see how it works out in daily usage.
@JGuetschow do you have a nice sparse dataset I can play with? I can of course easily build an artificial sparse dataset, but if you happen to just have something around which is real data and sparse, I could use that and be closer to real usage than with something artificially generated.
Any of the output/BUR_NIR_combined_all_XXXX
datasets in the UNFCCC_data repo (it's private but you should have access)
Right?
You got it (and whether that primary key is another dimension or something we add automatically doesn't really matter but was something that was a bit of mucking around to automate from memory).
If you use set_xindex() to add an index on the primary key, you can even select efficiently and easily on the sparse coordinates
That's cool. Maybe xarray has thought about this problem too and has an even better solution as that API didn't exist when we first wrote it I don't think and it seems quite specific to such handling...
This could honestly be a low-effort route that we can drop in with only 1 or 2 days of work and see how it works out in daily usage
Nice
So I played around a little with this while waiting for Excel things in ndcs land.
It works in principle, but there are definitely some things to work out still:
da.pr.set
)[https://primap2.readthedocs.io/en/stable/usage.html#Setting-data]. da.pr.loc
does not work yet, because it works on dimensions, not coordinates. Probably rather easy to fix.
we need more efficient handling of sparse data. Thinking about sparseness for every data set is worse than just having algorithms which handle sparse data.
In xarray (https://docs.xarray.dev/en/stable/internals/duck-arrays-integration.html?highlight=sparse#integrating-with-duck-arrays and https://github.com/pydata/xarray/blob/main/xarray/tests/test_sparse.py) this functionality is still experimental, but it could be pretty drop-in. Maybe we should see if we can make a primap array sparse and see if our tests still pass, that would be great!