openclimatefix / nowcasting_dataset

Prepare batches of data for training machine learning solar electricity nowcasting data
https://nowcasting-dataset.readthedocs.io/en/stable/
MIT License
25 stars 6 forks source link

Experiment with better compression for on-disk batches #500

Open JackKelly opened 2 years ago

JackKelly commented 2 years ago

Detailed Description

For example, pbzip2 reduces our NWP batches to 20% of their original size. Hopefully we can achieve similar reductions using "proper" NetCDF compression algorithms.

Smaller batches should be faster to load; and easier to upload to public cloud / Lancium / etc.

Related issues

Also, if we do find better compression, then we should probably use that better compression for our intermediate zarrs, too.

Not urgent.

peterdudfield commented 2 years ago

I remember using something like 'gzip' made the files smaller. But then it took longer to load. http://xarray.pydata.org/en/stable/generated/xarray.Dataset.to_netcdf.html I'm not sure on the right balance here

also just some general searching - not sure how useful this is https://www.unidata.ucar.edu/blogs/developer/entry/netcdf_compression

JackKelly commented 2 years ago

I'm not sure on the right balance here

yeah, I think the only way to tell is to do a bunch of experiments

JackKelly commented 2 years ago

tbh I wouldn't worry about better on-disk compression for v16. The compression we have now is fine, IMHO.