iridl / python-maprooms

Dash maprooms and tools
0 stars 4 forks source link

Remic/pepsico #425

Closed remicousin closed 5 months ago

remicousin commented 5 months ago

This is what I plan to use to zarrify ISIMIP. It's a generalization of enactstozarr. The main difference is that for ENACTS, data files and data have the same time step, resulting in no time dimension in nc files. Whereas ISIMIP data is daily in chunks of 10-yearly files.

It seemed to have worked for one variable, for one model, for one scenario. Remains to read that zarr back to see if it seems ok; and to add documenation with respect to the new optional arguments that facilitate the generalization.

Generally speaking, I expect pepsico to use a lot of what is in enacts but I can't have it part of it. So for now, copy-and-pasting things until we push one level up the common parts.

Review can start I think, in spite of what I mentiones remaind to be checked/done (Xandre and Jeff mostly as FYI

aaron-kaplan commented 5 months ago

If/when you're ready to make a shared zarrification library, you can put it in the top level of this repo alongside pingrid and controls.py, and symlink it into the subdirectories that use it. E.g. enacts/pingrid is a symlink to ../pingrid.

By the way, let's try to make sure all new zarrification scripts preserve or add appropriate CF standard_name metadata from now on.

remicousin commented 5 months ago

If/when you're ready to make a shared zarrification library, you can put it in the top level of this repo alongside pingrid and controls.py, and symlink it into the subdirectories that use it. E.g. enacts/pingrid is a symlink to ../pingrid.

By the way, let's try to make sure all new zarrification scripts preserve or add appropriate CF standard_name metadata from now on.

Will make it a common library outside of this PR, just for the sake of time and to have a zarr version asap for Sheen to work on it. But will try to do it sooner than later so that we don't linker with two versions meant to be the same.

Can you give an example of the CF conventions preservation/additions? or how/what to check exactly? or a workflow to run that check?

aaron-kaplan commented 5 months ago

OK for extracting a library in another PR.

There might be multiple CF conventions to preserve, but the one I have in mind at the moment is a single attribute called standard_name. The full list of possible values is here. Both data variables and coordinate variables should have it in most cases. E.g.

https://iridl.ldeo.columbia.edu/SOURCES/.Models/.SubX/.EMC/.GEFSv12_CPC/.hindcast/weekly/.pr/ standard_name: precipitation_flux

https://iridl.ldeo.columbia.edu/SOURCES/.Models/.SubC/.EMC/.GEFSv12/.hindcast/.weekly/.pr/.X/.standard_name/#expert standard_name: longitude

Set them in xarray like this:

ds['pr'].attrs['standard_name'] = 'precipitation_flux'
ds['pr']['X'].attrs['standard_name'] = 'longitude'
remicousin commented 5 months ago

The original nc files have standard_names and from a quick check, they are ported for all type of vars to the Xarray dataset that is then fed to to_zarr. They are correct for the coordinates (time, lon, lat). I haven't checked if they are coorect for the actual variables.

I am not sure how to automatize this whole process and it might not be possible but maybe a first step I can do now is stop the zarrification process if at least one of the vars involved doesn't have a standard_name? Then for now, the data curator will have to adapt their code to add it, and maybe later we can figure out some more involved automation?

aaron-kaplan commented 5 months ago

That sounds fine.