pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.63k stars 1.09k forks source link

Should performance be equivalent when opening with chunks or re-chunking a dataset? #3486

Open rafa-guedes opened 5 years ago

rafa-guedes commented 5 years ago

I was wondering if the chunking behaviour would be expected to be equivalent under two different use cases:

(1) When opening a dataset using the chunks option; (2) When re-chunking an existing dataset using Dataset.chunk method.

I'm interested in performance for slicing across different dimensions. In my case the performance is quite different, please see the example below:

Open dataset with one single chunk along station dimension (fast for slicing one time)

In [1]: import xarray as xr

In [2]: dset = xr.open_dataset( 
    ...: "/source/wavespectra/tests/sample_files/spec20170101T00_spec.nc", 
    ...: chunks={"station": None}
    ...: )

In [3]: dset
Out[3]: 
<xarray.Dataset>
Dimensions:       (direction: 24, frequency: 25, station: 14048, time: 249)
Coordinates:
  * time          (time) datetime64[ns] 2017-01-01 ... 2017-02-01
  * station       (station) float64 1.0 2.0 3.0 ... 1.405e+04 1.405e+04
  * frequency     (frequency) float32 0.04118 0.045298003 ... 0.40561208
  * direction     (direction) float32 90.0 75.0 60.0 45.0 ... 135.0 120.0 105.0
Data variables:
    longitude     (time, station) float32 dask.array<chunksize=(249, 14048), meta=np.ndarray>
    latitude      (time, station) float32 dask.array<chunksize=(249, 14048), meta=np.ndarray>
    efth          (time, station, frequency, direction) float32 dask.array<chunksize=(249, 14048, 25, 24), meta=np.ndarray>

In [4]: %time lats = dset.latitude.isel(time=0).values
CPU times: user 171 ms, sys: 49.2 ms, total: 220 ms
Wall time: 219 ms

Open dataset with many size=1 chunks along station dimension (fast for slicing one station, slow for slicing one time)

In [5]: dset = xr.open_dataset( 
    ...: "/source/wavespectra/tests/sample_files/spec20170101T00_spec.nc", 
    ...: chunks={"station": 1}
    ...: )

In [6]: dset
Out[6]: 
<xarray.Dataset>
Dimensions:       (direction: 24, frequency: 25, station: 14048, time: 249)
Coordinates:
  * time          (time) datetime64[ns] 2017-01-01 ... 2017-02-01
  * station       (station) float64 1.0 2.0 3.0 ... 1.405e+04 1.405e+04
  * frequency     (frequency) float32 0.04118 0.045298003 ... 0.40561208
  * direction     (direction) float32 90.0 75.0 60.0 45.0 ... 135.0 120.0 105.0
Data variables:
    longitude     (time, station) float32 dask.array<chunksize=(249, 1), meta=np.ndarray>
    latitude      (time, station) float32 dask.array<chunksize=(249, 1), meta=np.ndarray>
    efth          (time, station, frequency, direction) float32 dask.array<chunksize=(249, 1, 25, 24), meta=np.ndarray>

In [7]: %time lats = dset.latitude.isel(time=0).values
CPU times: user 13.1 s, sys: 1.94 s, total: 15 s
Wall time: 11.1 s

Try rechunk station into one single chunk (still slow to slice one time)

In [8]: dset = dset.chunk({"station": None})

In [8]: dset
Out[8]: 
<xarray.Dataset>
Dimensions:       (direction: 24, frequency: 25, station: 14048, time: 249)
Coordinates:
  * time          (time) datetime64[ns] 2017-01-01 ... 2017-02-01
  * station       (station) float64 1.0 2.0 3.0 ... 1.405e+04 1.405e+04
  * frequency     (frequency) float32 0.04118 0.045298003 ... 0.40561208
  * direction     (direction) float32 90.0 75.0 60.0 45.0 ... 135.0 120.0 105.0
Data variables:
    longitude     (time, station) float32 dask.array<chunksize=(249, 14048), meta=np.ndarray>
    latitude      (time, station) float32 dask.array<chunksize=(249, 14048), meta=np.ndarray>
    efth          (time, station, frequency, direction) float32 dask.array<chunksize=(249, 14048, 25, 24), meta=np.ndarray>

In [9]: %time lats = dset.latitude.isel(time=0).values
CPU times: user 9.06 s, sys: 1.13 s, total: 10.2 s
Wall time: 7.7 s
mullenkamp commented 3 years ago

This seems to be an ongoing problem (Unexpected behaviour when chunking with multiple netcdf files in xarray/dask, Performance of chunking in xarray / dask when opening and re-chunking a dataset) that has not been resolved nor has feedback been provided.

I've been running into this problem trying to handle netcdfs that are larger than my RAM. From my testing, chunks must be passed with open_mfdataset to be of any use. The chunks method on the datatset after opening seems to do nothing in this use case.

dcherian commented 3 years ago

What happens is that dask first constructs chunks of size specified in open_mfdataset and then breaks those up to new chunk sizes specified in the .chunk() call.

A similar behaviour is present for repeated chunk calls .chunk().chunk(), these do not get optimized to a single chunk call yet.

So yes, you should pass appropriate chunk sizes in open_mfdataset