Support distribution of `xarray`

ClaudiaComito commented 1 year ago

If I understand correctly, an xarray object is made up of the actual data array (np.ndarray), and ~~1-D~~ coordinates arrays (dictionaries?) that map data dimensions and indices to meaningful physical quantities.

For example, if xarray is a matrix of coordinates (date, temperature), users will be able to perform

mean_temp = xarray['2010_01_01', '2010_12_31'].mean()

Feature functionality

Enable distribution of xarray object, allow named dimensions, keep track of coordinates arrays, one of which will be distributed.

Example, :

ht_xarray = ht.array(xarray, split="date")
ht_mean_temp = ht_xarray['2010_01_01', '2010_12_31'].mean() # memory-distr. operation

Check out Pytorch's named tensors functionality.

Additional context Initiating collaboration with N. Koldunov @koldunovn at Alfred Wegener Institute (Helmholtz centre for polar and marine research). Also interesting for @kleinert-f, @ben-bou Tagging @bhagemeier for help with implementation.

koldunovn commented 1 year ago

Coordinates are not necessarily 1D arrays. For example for curvilinear grids covering the Earth's surface one would have to describe positions of (e.g. centers) of grid points as 2D arrays (one for lon and one for lat), and coordinate arrays then will be 2D.

ClaudiaComito commented 1 year ago

Got it, thanks @koldunovn, edited original post.

ClaudiaComito commented 1 year ago

Implementation discussed during devs meeting Sept 28. Preliminary scheme:

introduce dndarray.coordinates attribute (default=None)
ht.array() to accept xarray input, interpret string split, populate dndarray.coordinates, chunk data AND split coordinate
every operation needs to check if not coordinates: or similar, quantify and minimize performance hit

@Markus-Goetz would be good if you chime in

ClaudiaComito commented 1 year ago

tagging @Mystic-Slice as they have shown interest in working on this :wave:

ClaudiaComito commented 1 year ago

Interesting discussion over at array-api about single-node parallelism, but xarray also mentioned in distributed execution context.

@TomNicholas you might be interested in this effort as well.

Mystic-Slice commented 1 year ago

Hi everyone! I have been thinking about this project for some time now and here are my initial thoughts.

DXarray:

Xarray uses np.ndarray for its data storage. And since numpy does not support computation on GPU, the xarray objects cannot be directly chunked and stored within the new DXarray object.

There are two solutions:

Using CuPy arrays within xarray. see (cupy-xarray)
- As far as I know, this must be very simple and straightforward to use.
- This sounds like the best way to get most if not all the functionalities of xarray without having to re-implement.
- My only problem with CuPy is that, it is strictly GPU-based. It will limit flexibility if we dont switch between numpy and cupy throughout the code. (But maybe cupy-xarray handles that internally and we dont have to worry about that?)
Implement our own version of xarray using torch.Tensors.
- Except for a few functionalities (like array broadcasting, grouping and data alignment), everything else must be easy enough to implement by just translating the labels to corresponding indices and redirecting to the existing DNDarray methods.
- Example:
```
>>> xarray = xr.DataArray(
    data=np.arange(48).reshape(4, 2, 6),
    dims=("u", "v", "time"),
    coords={
        "u": [-3.2, 2.1, 5.3, 6.5],
        "v": [-1, 2.6],
        "time": pd.date_range("2009-01-05", periods=6, freq="M"),
    },
)
```
Single select

arr.sel(u=5.3, time="2009-04-30") # array([27, 33])

Translated to

arr[2, :, 3] # array([27, 33])

Multi select
arr.sel(u=[-3.2, 6.5], time=slice("2009-02-28", "2009-03-31")) # array([[[ 1, 2], [ 7, 8]], [[37, 38], [43, 44]]])

Translated to

arr[[0, 3], :, slice(1, 2 + 1)] # array([[[ 1, 2], [ 7, 8]], [[37, 38], [43, 44]]])
```
- But this means we will have to implement any new feature/functionality that is released by Xarray in the future.
```

I would love to know what you all think. Any suggestion is highly appreciated!

TomNicholas commented 1 year ago

Hi everyone, thanks for tagging me here!

I'm a bit unclear what you would like to achieve here - are you talking about:

(a) making DNDArray wrap xarray objects? (b) wrapping DNDArray inside xarray objects? (related Q: does DNDArray obey the python array API standard?) (c) wrapping DNDArray inside xarray objects but also with an understanding of chunks? (similar to dask - this is what the issue you linked to is discussing @ClaudiaComito) (d) just creating DNDArray objects from xarray objects?

Some miscellaneous comments:

It will limit flexibility if we dont switch between numpy and cupy throughout the code.

This is the point of the get_array_module pattern - you should then not have to have separate code paths for working with GPU vs CPU data.

Implement our own version of xarray using torch.Tensors

There have been some discussions of wrapping pytorch tensors in xarray. We also have an ongoing project to publicly expose xarray.Variable, a lower-level abstraction better-suited for use as a building block of efforts to make named tensors like torch.Tensor. In a better-coordinated world this would have already happened, and pytorch would now have an optional dependency on xarray. :man_shrugging:

Mystic-Slice commented 1 year ago

Hi everyone, thanks for tagging me here!

I'm a bit unclear what you would like to achieve here - are you talking about:

(a) making DNDArray wrap xarray objects? (b) wrapping DNDArray inside xarray objects? (related Q: does DNDArray obey the python array API standard?) (c) wrapping DNDArray inside xarray objects but also with an understanding of chunks? (similar to dask - this is what the issue you linked to is discussing @ClaudiaComito) (d) just creating DNDArray objects from xarray objects?

Some miscellaneous comments:

It will limit flexibility if we dont switch between numpy and cupy throughout the code.

This is the point of the get_array_module pattern - you should then not have to have separate code paths for working with GPU vs CPU data.

Implement our own version of xarray using torch.Tensors

There have been some discussions of wrapping pytorch tensors in xarray. We also have an ongoing project to publicly expose xarray.Variable, a lower-level abstraction better-suited for use as a building block of efforts to make named tensors like torch.Tensor. In a better-coordinated world this would have already happened, and pytorch would now have an optional dependency on xarray. 🤷‍♂️

Hi @TomNicholas! We want to create a Distributed-Xarray. This should be able to be chunked and distributed to different processes. Most users would also prefer to use GPU-based computation. So, I guess we are trying to achieve option (a) or create a new distributed data structure of our own that mimics Xarray (sounds like a lot of work😅).

This should allow users to manipulate huge amounts of data while also being able to work with the more-intuitive Xarray API.

@ClaudiaComito should be able to shed more light on this. Thanks for taking the time to speak with us!

koldunovn commented 1 year ago

Would the experience of xarray with Dask make creation of the data structure you want? Also there are implementations with GPU support https://xarray.dev/blog/xarray-kvikio

mrfh92 commented 1 year ago

Meanwhile I have implemented a basic idea of DXarray in the corresponding branch. In the near future I plan to go on with:

DXarray.to_xarray(): convert distributed DXarray to a xarray.DataArray (similar to DNDarray.to_numpy())
resplit for DXarray
from_xarray': convert a non-distributedxarray.DataArrayto distributedDXarray`
parallel I/O for DXarray (depending on how this can be done for an xarray-like structure)
arithmetic and statistical operations that act on the value-array only (such as mean, etc.)

Things that might get a bit more complicated:

pandas-dataframes as coordinates
masked value-arrays
operations that involve values and coordinates and thus cannot be copied directly from routines for DNDarray applied to the value array

TomNicholas commented 1 year ago

@ClaudiaComito in #1154 it seems you started wrapping heat objects inside xarray, which is awesome! I recently improved the documentation on wrapping numpy-like xarrays with xarray objects (https://github.com/pydata/xarray/pull/7911 and https://github.com/pydata/xarray/pull/7951).

Those extra pages in the docs aren't released yet, but for now you can view them here (on wrapping numpy-like arrays) and here (on wrapping distributed numpy-like arrays).

mrfh92 commented 1 year ago

In the branch 1031-support-distribution-of-xarray there is now available:

a class DXarray implementing sth similar as xarray but using DNDarrays as underlying objects
balace and resplit for DXarray
printing and conversion to and from xarray.DataArrays's (still missing: unit tests...)

I will stop here, until we have discussed in the team the "wrapping-approach" proposed by TomNicholas, because such an approach would be much easier to implement (if applicable to Heat).

github-actions[bot] commented 5 months ago

This issue is stale because it has been open for 60 days with no activity.

github-actions[bot] commented 3 months ago

This issue was closed because it has been inactive for 60 days since being marked as stale.

helmholtz-analytics / heat

Support distribution of `xarray` #1031

DXarray:

Single select

Translated to

Multi select

Translated to