Slow performance of isel

JohnMrziglod commented 6 years ago

Hi,

I get a very slow performance of Dataset.isel or DataArray.isel in comparison with the native numpy approach. Do you know where this comes from?

ds = xr.Dataset(
    {
        "a": ("time", np.arange(55_000_000))
    }, coords={
        "time": np.arange(55_000_000)
    }
)
time_filter = ds.time > 50_000

Select some values with DataArray.isel:

%timeit ds.a.isel(time=time_filter)

2.22 s ± 375 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Use the native numpy approach:

%timeit ds.a.values[time_filter]

163 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Linux OS-release: 3.16.0-4-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.utf8 LOCALE: en_US.UTF-8 xarray: 0.10.4 pandas: 0.23.0 numpy: 1.14.2 scipy: 1.1.0 netCDF4: 1.4.0 h5netcdf: 0.5.1 h5py: 2.8.0 Nio: None zarr: None bottleneck: 1.2.1 cyordereddict: None dask: 0.17.5 distributed: 1.21.8 matplotlib: 2.2.2 cartopy: 0.16.0 seaborn: 0.8.1 setuptools: 39.1.0 pip: 9.0.3 conda: None pytest: 3.5.1 IPython: 6.4.0 sphinx: 1.7.4

rabernat commented 6 years ago

I don't have experience using isel with boolean indexing. (Although the docs on positional indexing claim it is supported.) My guess is that that the time is being spent aligning the indexer with the array, which is unnecessary since you know they are already aligned. Probably not the most efficient pattern for xarray.

Here's how I would recommend writing the query using label-based selection:

%timeit ds.a.sel(time=slice(50_001, None))
117 ms ± 5.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

max-sixty commented 6 years ago

@rabernat that's a good solution where it's a slice

When is a time that it needs to align a bool array? If you try and pass an array of unequal length, it doesn't work anyway:

In [12]: ds.a.isel(time=time_filter[:-1])

IndexError: Boolean array size 54999999 is used to index array with shape (55000000,).

JohnMrziglod commented 6 years ago

I am sorry @rabernat and @maxim-lian , the variable's name time and the simple example with the greater than filter are misleading. In general, it is about using a boolean mask via isel and that it is very slow. In my code, I am not able to use your workaround since my boolean mask is more complex.

rabernat commented 6 years ago

Another part of the matrix of possibilities. Takes about half the time if you pass time_filter.values (numpy array) rather than the time_filter DataArray:

%timeit ds.a.isel(time=time_filter.values)
1.3 s ± 67.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

shoyer commented 6 years ago

My measurements:

>>> %timeit ds.a.isel(time=time_filter)
1 loop, best of 3: 906 ms per loop
>>> %timeit ds.a.isel(time=time_filter.values)
1 loop, best of 3: 447 ms per loop
>>> %timeit ds.a.values[time_filter]
10 loops, best of 3: 169 ms per loop

Given the size of this gap, I suspect this could be improved with some investigation and profiling, but there is certainly an upper-limit on the possible performance gain.

One simple example is that indexing the dataset needs to index both 'a' and 'time', so it's going to be at least twice as slow as only indexing 'a'. So the second indexing expression ds.a.isel(time=time_filter.values) is only 447/(169*2) = 1.32 times slower than the best case scenario.