intake / intake-xarray

Intake plugin for xarray
https://intake-xarray.readthedocs.io/
BSD 2-Clause "Simplified" License
76 stars 36 forks source link

Send correct blocks. #53

Open danielballan opened 5 years ago

danielballan commented 5 years ago

This fix affects access via the server.

The client side constructs an xarray.Dataset backed by dask arrays with some chunking. When it loads data, it requests partitions specified by a variable name and a block "part", as in ('x', 0, 0, 1).

If, on the server side, the DataSourceMixin subclass is holding a plain numpy array, not a dask array, then it ignores the "part" and always sends the whole array for the requested variable.

On the client side, this manifests as a mismatch between the dask array's shape (the shape of the data it is expected) and the shape of the numpy array that it receives, leading to errors like

ValueError: replacement data must match the Variable's shape

> /sdcc/u/dallan/venv/test-databroker/lib64/python3.6/site-packages/xarray/core/variable.py(301)data()
    299         if data.shape != self.shape:
    300             raise ValueError(
--> 301                 "replacement data must match the Variable's shape")
    302         self._data = data
    303 

ipdb>  data.shape
(164, 1, 4000, 3840)
ipdb>  self.shape
(41, 1, 1000, 960)

where data that arrives is larger than the data expected.

I expect it's worth refining this to make it more efficient before merging, and it needs a test. This is just a request for comments and suggestions.

martindurant commented 5 years ago

I haven't had a chance to investigate the failure

danielballan commented 5 years ago

The subclasses that override _get_schema override _get_schema in the base class DataSourceMixin without calling super(), so self._chunks is never defined. It looks like there is a fair amount of copy paste between the base class and its subclasses, so the easiest fix might be to remove that and use super(). Can't get to this today, but can revisit later this week.