pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.63k stars 1.09k forks source link

Add ability to change underlying array type #4234

Open jacobtomlinson opened 4 years ago

jacobtomlinson commented 4 years ago

Is your feature request related to a problem? Please describe.

In order to use Xarray with alternative array types like cupy the user needs to be able to specify the underlying array type without digging into internals.

Right now I'm doing something like this.

import xarray as xr
import cupy as cp

ds = xr.tutorial.load_dataset("air_temperature")
ds.air.data = cp.asarray(ds.air.data)

However this will become burdensome when there are many data arrays and feels brittle and prone to errors.

As I see it a conversion could instead be done in a couple of places; on load, or as a utility method.

Currently Xarray supports NumPy and Dask array well. Numpy is the defrault and the way you specify whether a Dask array should be used is to give the chunks kwargs to an open_ function or by calling .chunk() on a DataSet or DataArray.

Side note: There are a few places where the Dask array API bleeds into Xarray in order to have compatibility, the chunk kwarg/method is one, the .compute() method is another. I'm hesitant to do this for other array types, however surfacing the cupy.ndarray.get method could feel natural for cupy users. But for now I think it would be best to take Dask as a special case and try and be generic for everything else.

Describe the solution you'd like

For other array types I would like to propose the addition of an asarray kwarg for the open_ methods and an .asarray() method on DataSet and DataArray. This should take either the array type cupy.ndarray, the asarray method cp.asarray, or preferably either.

This would result in something like the following.

import xarray as xr
import cupy as cp

ds = xr.open_mfdataset("/path/to/files/*.nc", asarray=cp.ndarray)

# or

ds = xr.open_mfdataset("/path/to/files/*.nc")
gds = ds.asarray(cp.ndarray)

These operations would convert all data arrays to cupy arrays. For the case that ds is backed by Dask arrays it would use map_blocks to cast each block to the appropriate array type.

It is still unclear what to do about index variables, which are currently of type pandas.Index. For cupy it may be more appropriate to use a cudf.Index instead to ensure both are on the GPU. However this would add a dependency on cudf and potentially increase complexity here.

Describe alternatives you've considered

Instead of an asarray kwarg/method something like to_cupy/from_cupy could be done. However I feel this makes less sense because the object type is not changing, just that of the underlying data structure.

Another option would be to go more high level with it. For example a gpu kwarg and to_gpu/from_gpu method could be added in the same way. This would abstract things even further and give users a choice about hardware rather than software. This would also be a fine solution but I think it may special case too much and a more generic solution would be better.

Additional context Related to #4212.

I'm keen to start implementing this. But would like some discussion/feedback before I dive in here.

jacobtomlinson commented 4 years ago

cc @quasiben

jthielen commented 4 years ago

@jacobtomlinson Have you considered implementing an accessor for this and any other CuPy-specific functionality? This is the approach we are taking with Pint integration (pint-xarray), and I wonder if it might also be a good approach for CuPy.

jacobtomlinson commented 4 years ago

@jthielen Something like this?

import xarray as xr
import cupy as cp

@xr.register_dataset_accessor("cupy")
class CupyAccessor:
    def to_cupy(self):
        """Convert all data arrays to cupy."""

        for var in self.data_vars: 
            self.data_vars[var].data = cp.asarray(self.data_vars[var].data)

        return self

Which would then be used like this.

import xarray as xr
import cupy as cp

ds = xr.open_mfdataset("/path/to/files/*.nc")
gds = ds.cupy.to_cupy()
jthielen commented 4 years ago

@jacobtomlinson Indeed! A similar to_cupy could also be made for existing DataArrays on a DataArray accessor, and any other CuPy-related attributes/methods could be attached to the same accessors as well.

Though, to have those accessors be registered with just importing xarray and cupy, they would of course need to live in either xarray or cupy. Not sure if that or an xarray-adjacent package (cupy-xarray?) is better.

jacobtomlinson commented 4 years ago

This does sound like an option. However there are many situations within xarray where we need explicit cupy logic. Converting back to numpy before plotting is one example. I don't think that kind of logic can live in an accessor.

Unless you expect users to do something like this.

import xarray as xr
import cupy as cp

ds = xr.tutorial.load_dataset("air_temperature")
gds = ds.cupy.to_cupy()

# Do some manipulation on the GPU

# Grab a time slice
time_slice = gds.air.isel(time=0)

time_slice.cupy.to_numpy().plot()  # I would hope that time_slice.plot() would work

I would be tempted to say that cupy is more like dask in that it is trying to implement the numpy array interface exactly but in a different paradigm (distributed, GPU, etc). And of course there are limitations and divergences because of the different paradigm. However it's not the same as pint which is trying to extend numpy and add more functionality.

So it makes sense to me that accessors for pint exist to add this extra functionality to xarray. But at least in theory cupy should be a drop-in replacement for numpy. So I don't expect a huge amount of logic will live in an accessor, compared to the amount of compatibility code that will need to go into xarray itself.

jthielen commented 4 years ago

@jacobtomlinson Makes complete sense! Just wanted to make sure the option was considered as a possibility.

jacobtomlinson commented 4 years ago

The only things I can think of that would make sense initially in an accessor would be to_cupy/from_cupy to move back and forth from the GPU, plus maybe an alias for get to from_cupy to match the cupy API.

An accessor does seem like a reasonable place to put that logic, but it also seems like a tiny amount of code to make, ship and maintain a separate package for. Plus those methods will either need to be used or duplicated in the core codebase to support things like plotting.

jthielen commented 4 years ago

The accessors could also be where wrapped xarray methods could go (so in your previous example, it could be time_slice.cupy.plot() to handle the NumPy conversion internally), but based on your explanations I'd agree that most of this makes more sense to be compatibility code that goes in xarray itself.

dcherian commented 4 years ago

See similar discussion for sparse here: https://github.com/pydata/xarray/issues/3245

asarray makes sense to me.

I think we are also open to special as_sparse, as_dense, as_cupy that return xarray objects with converted arrays.

A to_numpy_data method (or as_numpy?) would always coerce to numpy appropriately.

IIRC there's some way to read from disk to GPU, isn't there? So it makes sense to expose that in our open_* functions.

Re: index variables.Can we avoid this for now? Or are there going to be performance issues? The general problem will be handled as part of the index refactor (we've deferred pint support for indexes for this reason).

jacobtomlinson commented 4 years ago

Those as_ methods sounds good. Would as_dense be the same as as_numpy?

Yeah it is possible to read direct to GPU from storage with GDS. We've experimented a little with zarr, I expect if something like zarr got GDS support and a zarr dataset was configured to use GDS then xr.open_zarr('zarr/path') would be already backed by cupy because the zarr internal array would be cupy.

Re: index variables.Can we avoid this for now?

I think that would be best.

However I did run into issues when trying to run the Compare weighted and unweighted mean temperature example with cupy. In that example the weights data array is generated from the latitude index. So the resulting DataArray is backed by numpy. I would expect that if it were a cuDF Index it would end up as a cupy data array.

In my testing I just cast the weights data array to cupy and carried on. So perhaps with this change users will just need to sprinkle some weights = weights.as_cupy() or weights = weights.asarray(cp.ndarray) type calls throughout their code when they need to.

quasiben commented 3 years ago

Wanted to update on some recent work. With NEP35 (https://github.com/numpy/numpy/pull/16935, https://github.com/numpy/numpy/pull/17787) experimentally in NumPy, I've been exploring a bit more on using GPUs with xarray starting off with basic groupby workflows. There are places in the code where xarray calls pandas directly. For example, when building Indexes:

https://github.com/pydata/xarray/blob/7152b41fa80a56db0ce88b241fbe4092473cfcf0/xarray/core/dataset.py#L150-L153

This is going to be challenging primarily because xarray will need to determine which DataFrame library to use during Index creation (possibly other DataFrame objects). While there is a consortium developing potential solutions for libraries to leverage multiple dataframe libraries I was going to keep hacking away and see what other issues may be lurking.

dcherian commented 3 years ago

The indexes story will change soon, we may even have our own index classes.

We should have pretty decent support for NEP-18 arrays in DataArray.data though, so IMO that's the best thing to try out and see where the issues remain.

NEP35 is cool; looks like we should use it in our *_like functions.