Open Peter9192 opened 2 weeks ago
Yes, it's confusing. The method name shouldn't be called to_dask()
, but rather to_dataset()
instead. This however won't change anymore, as intake now has a new major version (which we probably don't want to use, but which still means that the old version won't get that many new features anymore).
We tried to counteract the confusion by documenting it in the introduction of the dataset on easygems.
There are good reasons for both variants (null
and auto
), but I think there's no clear agreement on which is better (it depends on the use case).
chunks: null
will result in using LazilyIndexedArray
, which is a very lightweight (in terms of additional metadata) lazy array, but only indexing (not computation) is carried out lazily. This is well suited for subselecting and inspecting data, and will be faster and uses lower memory for those cases. It also has some advantages when fetching multiple chunks asynchronously. It will however fail quickly if you use it for cases in which lazy computations are required in addition to lazy indexing.chunks: auto
will lead to xarray using dask, with one dask chunk per storage-chunk. Using dask has the advantage, that also computations are carried out lazily, but comes at a large additional cost for the task graph, which dask has to create, so for simple exploration of the datasets, this can be overly expensive. Furthermore, usually we want storage chunks in the order of 1 MB, but dask prefers in-memory chunks in the order of 100 MB. This leads to the problem of deciding how to rechunk data on load (e.g. stack them together in time and/or in space?). The best option is dependent on the analysis usecase, and we can't and shouldn't decide that on catalog level (maybe an exception would be, if we would provide separate catalogs for different usecases). So when using dask, it would actually be better (performance wise) to specify a good chunking explicitly as a user (like chunk={'time': 24*7}
for weekly chunks of an hourly dataset). Note that you really want to specify those chunks upon opening the dataset, not afterwards using .chunk(...)
, in order to avoid creating an overly large task-graph.So there's a choice between not using dask and using dask, but when using dask, it's usually better to do a concious decision on how to chunk (there may be a concious decision to use auto
, but that will come with a performance penalty users should be aware of). Out of this reasoning, one additional hope of using null
in the definition of the catalog has been that users will notice that this doesn't use dask and will ask and learn about why :-)
Note that using chunks="auto"
is the default behaviour of open_zarr
, while chunks=None
is the default behaviour of open_dataset
. open_zarr
is kind of a legacy function (it has been implemented in the early days of zarr and is now used in some places, so people rely on it). The recommended way to open zarr is using open_dataset
.
Okay, that makes sense. Thanks!
Hello,
I was confused that the following snippet returned a "normal"
xarray
object instead of one backed by dask arrays:I think the expectation of a method called
to_dask()
is that it returns a dask-backed xarray object.In
xarray.open_zarr
, the default for chunks is set toauto
, and passingNone
explicitly prohibits the use of dask. So this specific behaviour is due to explictly settingchunks: null
in the catalogs, e.g. here:https://github.com/nextGEMS/catalog/blob/8c3f5f513fac93d8fcb3b1646a5b6cfc93824b14/ICON/main.yaml#L188
I think it would make sense to change these to
auto
, or{}
as that seems to better respect the original chunking of the zarr store. Or, if possible, perhaps omit it altogether so as to follow the default behaviour of xarray's zarr backend?But perhaps there is some specific reasons that I'm not aware of?