nextGEMS / catalog

Intake catalog for nextgems
5 stars 7 forks source link

Setting chunks to `null` is confusing when calling `to_dask` #98

Open Peter9192 opened 2 weeks ago

Peter9192 commented 2 weeks ago

Hello,

I was confused that the following snippet returned a "normal" xarray object instead of one backed by dask arrays:

cat = intake.open_catalog("https://data.nextgems-h2020.eu/catalog.yaml")
cat.ICON['ngc3028'](zoom=5, time='P1D').to_dask()
# returns a normal xarray dataset, without use of dask

# workaround: explicitly pass in chunks on init
cat.ICON['ngc3028'](zoom=5, time='P1D', chunks='auto').to_dask()
# returns a dask-backed xarray dataset

I think the expectation of a method called to_dask() is that it returns a dask-backed xarray object.

In xarray.open_zarr, the default for chunks is set to auto, and passing None explicitly prohibits the use of dask. So this specific behaviour is due to explictly setting chunks: null in the catalogs, e.g. here:

https://github.com/nextGEMS/catalog/blob/8c3f5f513fac93d8fcb3b1646a5b6cfc93824b14/ICON/main.yaml#L188

I think it would make sense to change these to auto, or {} as that seems to better respect the original chunking of the zarr store. Or, if possible, perhaps omit it altogether so as to follow the default behaviour of xarray's zarr backend?

But perhaps there is some specific reasons that I'm not aware of?

d70-t commented 2 weeks ago

Yes, it's confusing. The method name shouldn't be called to_dask(), but rather to_dataset() instead. This however won't change anymore, as intake now has a new major version (which we probably don't want to use, but which still means that the old version won't get that many new features anymore).

We tried to counteract the confusion by documenting it in the introduction of the dataset on easygems. There are good reasons for both variants (null and auto), but I think there's no clear agreement on which is better (it depends on the use case).

So there's a choice between not using dask and using dask, but when using dask, it's usually better to do a concious decision on how to chunk (there may be a concious decision to use auto, but that will come with a performance penalty users should be aware of). Out of this reasoning, one additional hope of using null in the definition of the catalog has been that users will notice that this doesn't use dask and will ask and learn about why :-)

d70-t commented 2 weeks ago

Note that using chunks="auto" is the default behaviour of open_zarr, while chunks=None is the default behaviour of open_dataset. open_zarr is kind of a legacy function (it has been implemented in the early days of zarr and is now used in some places, so people rely on it). The recommended way to open zarr is using open_dataset.

Peter9192 commented 2 weeks ago

Okay, that makes sense. Thanks!