Closed remicousin closed 5 months ago
If/when you're ready to make a shared zarrification library, you can put it in the top level of this repo alongside pingrid
and controls.py
, and symlink it into the subdirectories that use it. E.g. enacts/pingrid
is a symlink to ../pingrid
.
By the way, let's try to make sure all new zarrification scripts preserve or add appropriate CF standard_name
metadata from now on.
If/when you're ready to make a shared zarrification library, you can put it in the top level of this repo alongside
pingrid
andcontrols.py
, and symlink it into the subdirectories that use it. E.g.enacts/pingrid
is a symlink to../pingrid
.By the way, let's try to make sure all new zarrification scripts preserve or add appropriate CF
standard_name
metadata from now on.
Will make it a common library outside of this PR, just for the sake of time and to have a zarr version asap for Sheen to work on it. But will try to do it sooner than later so that we don't linker with two versions meant to be the same.
Can you give an example of the CF conventions preservation/additions? or how/what to check exactly? or a workflow to run that check?
OK for extracting a library in another PR.
There might be multiple CF conventions to preserve, but the one I have in mind at the moment is a single attribute called standard_name
. The full list of possible values is here. Both data variables and coordinate variables should have it in most cases. E.g.
https://iridl.ldeo.columbia.edu/SOURCES/.Models/.SubX/.EMC/.GEFSv12_CPC/.hindcast/weekly/.pr/ standard_name: precipitation_flux
https://iridl.ldeo.columbia.edu/SOURCES/.Models/.SubC/.EMC/.GEFSv12/.hindcast/.weekly/.pr/.X/.standard_name/#expert standard_name: longitude
Set them in xarray like this:
ds['pr'].attrs['standard_name'] = 'precipitation_flux'
ds['pr']['X'].attrs['standard_name'] = 'longitude'
The original nc files have standard_names and from a quick check, they are ported for all type of vars to the Xarray dataset that is then fed to to_zarr. They are correct for the coordinates (time, lon, lat). I haven't checked if they are coorect for the actual variables.
I am not sure how to automatize this whole process and it might not be possible but maybe a first step I can do now is stop the zarrification process if at least one of the vars involved doesn't have a standard_name? Then for now, the data curator will have to adapt their code to add it, and maybe later we can figure out some more involved automation?
That sounds fine.
This is what I plan to use to zarrify ISIMIP. It's a generalization of enactstozarr. The main difference is that for ENACTS, data files and data have the same time step, resulting in no time dimension in nc files. Whereas ISIMIP data is daily in chunks of 10-yearly files.
It seemed to have worked for one variable, for one model, for one scenario. Remains to read that zarr back to see if it seems ok; and to add documenation with respect to the new optional arguments that facilitate the generalization.
Generally speaking, I expect pepsico to use a lot of what is in enacts but I can't have it part of it. So for now, copy-and-pasting things until we push one level up the common parts.
Review can start I think, in spite of what I mentiones remaind to be checked/done (Xandre and Jeff mostly as FYI