leap-stc / data-management

Collection of code to manually populate the persistent cloud bucket with data
https://catalog.leap.columbia.edu/
Apache License 2.0
0 stars 6 forks source link

Can we configure the 'engine' argument in the displayed xarray/python repr? #150

Closed norlandrhagen closed 1 month ago

norlandrhagen commented 2 months ago

Thinking about if we add any virtualzarr reference datasets to the catalog. It would be nice if we could update the default 'engine="zarr"` with a value from either the meta.yaml or the catalog.yaml.

image

ds = xr.open_dataset(<reference_file_url>, engine="kerchunk", chunks={})

jbusecke commented 1 month ago

I added high priority label, since I think this will be quite important to support upcoming virtualized datasets like https://github.com/leap-stc/data-management/issues/118 and others.

from either the meta.yaml or the catalog.yaml.

Seems to me this is entirely a catalog 'feature', so I would vote for catalog.yaml

andersy005 commented 1 month ago

ds = xr.open_dataset(, engine="kerchunk", chunks={})

@norlandrhagen, i'm curious... is (engine='kerchunk') all you need to load reference file on xarray dataset? i'm trying to figure out what other changes i need to make in https://github.com/carbonplan/html-reprs/blob/094a7992cba029ce284f031623872e624bfafc48/src/app.py#L96

norlandrhagen commented 1 month ago

Yup! I think if Kerchunk is installed in the env, it should work.

Maybe we can default to engine='zarr' and then if engine_type exists as a field catalog.yaml, we supply it there?

Or we could backfill all the catalog.yaml's.

Thanks @andersy005!

andersy005 commented 1 month ago

perfect! can you point me to existing stores i can use for testing purposes?

norlandrhagen commented 1 month ago

Totally!


import xarray as xr 

store = 'https://rice1.osn.mghpcc.org/carbonplan/virtual_datasets/gridmet/gridmet_1979_2020.parquet'

ds = xr.open_dataset(store, engine="kerchunk", chunks={})
combined_ds
norlandrhagen commented 1 month ago

I can also put a reference on the leap osn bucket or the LEAP google storage if that helps.

andersy005 commented 1 month ago

Totally!

import xarray as xr 

store = 'https://rice1.osn.mghpcc.org/carbonplan/virtual_datasets/gridmet/gridmet_1979_2020.parquet'

ds = xr.open_dataset(store, engine="kerchunk", chunks={})
combined_ds

thank you, @norlandrhagen! i was able to use this for testing purposes

jbusecke commented 1 month ago

Thanks folks. Great to see all of these improvements moving quickly!