Supporting out-of-core computation/indexing for very large indexes

benbovy commented 8 years ago

(Follow-up of discussion here https://github.com/pydata/xarray/pull/1024#issuecomment-258524115).

xarray + dask.array successfully enable out-of-core computation for very large variables that doesn't fit in memory. One current limitation is that the indexes of a Dataset or DataArray, which rely on pandas.Index, are still fully loaded into memory (it will be soon loaded eagerly after #1024). In many cases this is not a problem, as the sizes of 1-dimensional indexes are usually much smaller than the sizes of n-dimensional variables or coordinates.

However, this may be problematic in some specific cases where we have to deal with very large indexes. As an example, big unstructured meshes often have coordinates (x, y, z) arranged as 1-d arrays of length that equals the number of nodes, which can be very large!! (See, e.g., ugrid conventions).

It would be very nice if xarray could also help for these use cases. Therefore I'm wondering if (and how) out-of-core support can be extended to indexes and indexing.

I've briefly looked at the documentation on dask.dataframe, and a first naive approach I have in mind would be to allow partitioning an index into multiple, contiguous indexes. For label-based indexing, we might for example map indexing.convert_label_indexer to each partition and combine the returned indexers.

My knowledge of dask is very limited, though. So I've no doubt that this suggestion is very simplistic and not very efficient, or that there are better approaches. I'm also certainly missing other issues not directly related to indexing.

Any thoughts?

cc @shoyer @mrocklin

shoyer commented 8 years ago

For unstructured meshes of points, pandas.MultiIndex is not the right abstraction.

Suppose you have a (very long) list of sorted points (x, y, z) in a multi-index. You can efficiently query within fixed bounds along x by doing binary search. But for queries in y and z, you cannot do any better than looking through the entire list. Moreover, pandas.MultiIndex factorizes each level into unique values, which is a complete waste on an unstructured grid where few coordinate overlap.

For unstructured meshes, you need something like a KDTree (see discussion in https://github.com/pydata/xarray/issues/475), with ideally with nearby points in space stored in contiguous array chunks.

I would start with trying to get an in-memory KDTree working, and then switch to something out of core only when/if necessary. For example, SciPy's cKDTree can load 1e7 points in 3-dimensions in only a few seconds:

x = np.random.rand(int(1e7), 3)
%time tree = scipy.spatial.cKDTree(x, leafsize=100)
# CPU times: user 2.58 s, sys: 0 ns, total: 2.58 s
# Wall time: 2.55 s

The might be good enough.

benbovy commented 8 years ago

Yes I understand that using a pandas.MultiIndex for such case is not efficient at all for both indexing and regarding memory usage and index building.

My example was actually not complete, since I also have categorical indexes such as a few regions defined in space (with complex geometries) and node types (e.g., boundary, active, inactive). Sorry not to have mentioned that.

a KDTree is indeed good for indexing on space coordinates. Looking at the API you suggest in #475, my (2-d) mesh might look like this:

>>> ds
<xray.Dataset>
Dimensions:         (node: 10000000)
Coordinates:
  * node
  - region          (node) int32 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ...
  - node_type       (node) object 'b' 'b' 'b' 'b' 'a' 'a' 'a' 'a' 'a' 'a' 'a' ...
  * spatial_index   (node) KDTree
  - x               (node) float64 49.754 56.823 65.765 93.058 96.691 105.832 ...
  - y               (node) float64 37.582 45.769 58.672 77.029 82.983 99.672 ...
Data variables:
    topo_elevation  (node) float64 57.352 48.710 47.080 49.072 33.184 54.833 ...

Anyway, maybe I've opened this issue a bit too early since my data still fits into memory, though it is likely that I'll have to deal with meshes of 1e8 to 1e9 nodes in a near future.

Side note: I don't know why I get much worse performance on my machine when building the KDTree? (Intel(R) Xeon(R) CPU x4 5160 @ 3.00GHz, 16 Gb RAM, scipy 0.18.1, numpy 1.11.2)

In [3]: x = np.random.rand(int(1e7), 3)

In [4]: %time tree = scipy.spatial.cKDTree(x, leafsize=100)
CPU times: user 38 s, sys: 64 ms, total: 38.1 s
Wall time: 38.1 s

rabernat commented 8 years ago

My cKDTree time was:

19.2 s on a 32-core Intel(R) Xeon(R) CPU E5-4627 v2 @ 3.30GHz, 512 GB RAM.
23 s on my Macbook (1.7 GHz Intel Core i7)

stale[bot] commented 5 years ago

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity If this issue remains relevant, please comment here; otherwise it will be marked as closed automatically

TomAugspurger commented 3 years ago

Should this and https://github.com/pydata/xarray/issues/1650 be consolidated into a single issue? I think that they're duplicates of eachother.

pydata / xarray

Supporting out-of-core computation/indexing for very large indexes #1094