Setting chunks to `null` is confusing when calling `to_dask`

Peter9192 commented 2 weeks ago

Hello,

I was confused that the following snippet returned a "normal" xarray object instead of one backed by dask arrays:

cat = intake.open_catalog("https://data.nextgems-h2020.eu/catalog.yaml")
cat.ICON['ngc3028'](zoom=5, time='P1D').to_dask()
# returns a normal xarray dataset, without use of dask

# workaround: explicitly pass in chunks on init
cat.ICON['ngc3028'](zoom=5, time='P1D', chunks='auto').to_dask()
# returns a dask-backed xarray dataset

I think the expectation of a method called to_dask() is that it returns a dask-backed xarray object.

In xarray.open_zarr, the default for chunks is set to auto, and passing None explicitly prohibits the use of dask. So this specific behaviour is due to explictly setting chunks: null in the catalogs, e.g. here:

https://github.com/nextGEMS/catalog/blob/8c3f5f513fac93d8fcb3b1646a5b6cfc93824b14/ICON/main.yaml#L188

I think it would make sense to change these to auto, or {} as that seems to better respect the original chunking of the zarr store. Or, if possible, perhaps omit it altogether so as to follow the default behaviour of xarray's zarr backend?

But perhaps there is some specific reasons that I'm not aware of?

d70-t commented 2 weeks ago

Yes, it's confusing. The method name shouldn't be called to_dask(), but rather to_dataset() instead. This however won't change anymore, as intake now has a new major version (which we probably don't want to use, but which still means that the old version won't get that many new features anymore).

We tried to counteract the confusion by documenting it in the introduction of the dataset on easygems. There are good reasons for both variants (null and auto), but I think there's no clear agreement on which is better (it depends on the use case).

Using chunks: null will result in using LazilyIndexedArray, which is a very lightweight (in terms of additional metadata) lazy array, but only indexing (not computation) is carried out lazily. This is well suited for subselecting and inspecting data, and will be faster and uses lower memory for those cases. It also has some advantages when fetching multiple chunks asynchronously. It will however fail quickly if you use it for cases in which lazy computations are required in addition to lazy indexing.
Using chunks: auto will lead to xarray using dask, with one dask chunk per storage-chunk. Using dask has the advantage, that also computations are carried out lazily, but comes at a large additional cost for the task graph, which dask has to create, so for simple exploration of the datasets, this can be overly expensive. Furthermore, usually we want storage chunks in the order of 1 MB, but dask prefers in-memory chunks in the order of 100 MB. This leads to the problem of deciding how to rechunk data on load (e.g. stack them together in time and/or in space?). The best option is dependent on the analysis usecase, and we can't and shouldn't decide that on catalog level (maybe an exception would be, if we would provide separate catalogs for different usecases). So when using dask, it would actually be better (performance wise) to specify a good chunking explicitly as a user (like chunk={'time': 24*7} for weekly chunks of an hourly dataset). Note that you really want to specify those chunks upon opening the dataset, not afterwards using .chunk(...), in order to avoid creating an overly large task-graph.

So there's a choice between not using dask and using dask, but when using dask, it's usually better to do a concious decision on how to chunk (there may be a concious decision to use auto, but that will come with a performance penalty users should be aware of). Out of this reasoning, one additional hope of using null in the definition of the catalog has been that users will notice that this doesn't use dask and will ask and learn about why :-)

d70-t commented 2 weeks ago

Note that using chunks="auto" is the default behaviour of open_zarr, while chunks=None is the default behaviour of open_dataset. open_zarr is kind of a legacy function (it has been implemented in the early days of zarr and is now used in some places, so people rely on it). The recommended way to open zarr is using open_dataset.

Peter9192 commented 2 weeks ago

Okay, that makes sense. Thanks!

nextGEMS / catalog

Setting chunks to `null` is confusing when calling `to_dask` #98