leap-stc / climsim_feedstock

Apache License 2.0
0 stars 0 forks source link

Switch to virtual zarr generation (at least for low-res data). #9

Open jbusecke opened 1 month ago

jbusecke commented 1 month ago

I just confirmed that the newly uploaded expanded ClimSim version is compatible with creating virtual zarr references. I think it is abolutely worth tooling pgf-stages to virtualize these files.


minimal example

from huggingface_hub import HfFileSystem
fs = HfFileSystem()

old_files = fs.glob("datasets/LEAP/ClimSim_low-res/train/0001-02/*0001-02-01-00000.nc")

new_files = fs.glob(
    "datasets/LEAP/ClimSim_low-res-expanded/train/0001-02/*.0001-02-01-02400.nc"
)

import fsspec
import xarray as xr
for files in [old_files, new_files]:
    for path in files:
        with fsspec.open('hf://'+path, mode='rb') as f:
            print(path)
            try:
                ds = xr.open_dataset(f).load()
                display(ds)
            except:
                print("FAILED")
zyhu-hu commented 1 month ago

Yes, the expanded training data contains all the same features from the original ClimSim_low-res but with additional input features that could be useful. My only concern is that the new expanded data is 3x larger than the original data. Each individual nc file is now 6MB vs originally 2MB.

jbusecke commented 1 month ago

Well as long as Huggingface will host them that is maybe something to consider later?

SammyAgrawal commented 2 days ago

Just to confirm, the old files are incompatible but new the ones are, correct?