Open jacobtomlinson opened 4 years ago
cc @quasiben
@jacobtomlinson Have you considered implementing an accessor for this and any other CuPy-specific functionality? This is the approach we are taking with Pint integration (pint-xarray), and I wonder if it might also be a good approach for CuPy.
@jthielen Something like this?
import xarray as xr
import cupy as cp
@xr.register_dataset_accessor("cupy")
class CupyAccessor:
def to_cupy(self):
"""Convert all data arrays to cupy."""
for var in self.data_vars:
self.data_vars[var].data = cp.asarray(self.data_vars[var].data)
return self
Which would then be used like this.
import xarray as xr
import cupy as cp
ds = xr.open_mfdataset("/path/to/files/*.nc")
gds = ds.cupy.to_cupy()
@jacobtomlinson Indeed! A similar to_cupy
could also be made for existing DataArrays on a DataArray accessor, and any other CuPy-related attributes/methods could be attached to the same accessors as well.
Though, to have those accessors be registered with just importing xarray
and cupy
, they would of course need to live in either xarray
or cupy
. Not sure if that or an xarray-adjacent package (cupy-xarray
?) is better.
This does sound like an option. However there are many situations within xarray where we need explicit cupy logic. Converting back to numpy before plotting is one example. I don't think that kind of logic can live in an accessor.
Unless you expect users to do something like this.
import xarray as xr
import cupy as cp
ds = xr.tutorial.load_dataset("air_temperature")
gds = ds.cupy.to_cupy()
# Do some manipulation on the GPU
# Grab a time slice
time_slice = gds.air.isel(time=0)
time_slice.cupy.to_numpy().plot() # I would hope that time_slice.plot() would work
I would be tempted to say that cupy is more like dask in that it is trying to implement the numpy array interface exactly but in a different paradigm (distributed, GPU, etc). And of course there are limitations and divergences because of the different paradigm. However it's not the same as pint which is trying to extend numpy and add more functionality.
So it makes sense to me that accessors for pint exist to add this extra functionality to xarray. But at least in theory cupy should be a drop-in replacement for numpy. So I don't expect a huge amount of logic will live in an accessor, compared to the amount of compatibility code that will need to go into xarray itself.
@jacobtomlinson Makes complete sense! Just wanted to make sure the option was considered as a possibility.
The only things I can think of that would make sense initially in an accessor would be to_cupy
/from_cupy
to move back and forth from the GPU, plus maybe an alias for get
to from_cupy
to match the cupy API.
An accessor does seem like a reasonable place to put that logic, but it also seems like a tiny amount of code to make, ship and maintain a separate package for. Plus those methods will either need to be used or duplicated in the core codebase to support things like plotting.
The accessors could also be where wrapped xarray methods could go (so in your previous example, it could be time_slice.cupy.plot()
to handle the NumPy conversion internally), but based on your explanations I'd agree that most of this makes more sense to be compatibility code that goes in xarray itself.
See similar discussion for sparse here: https://github.com/pydata/xarray/issues/3245
asarray
makes sense to me.
I think we are also open to special as_sparse
, as_dense
, as_cupy
that return xarray objects with converted arrays.
A to_numpy_data
method (or as_numpy
?) would always coerce to numpy appropriately.
IIRC there's some way to read from disk to GPU, isn't there? So it makes sense to expose that in our open_*
functions.
Re: index variables.Can we avoid this for now? Or are there going to be performance issues? The general problem will be handled as part of the index refactor (we've deferred pint support for indexes for this reason).
Those as_
methods sounds good. Would as_dense
be the same as as_numpy
?
Yeah it is possible to read direct to GPU from storage with GDS. We've experimented a little with zarr, I expect if something like zarr got GDS support and a zarr dataset was configured to use GDS then xr.open_zarr('zarr/path')
would be already backed by cupy because the zarr internal array would be cupy.
Re: index variables.Can we avoid this for now?
I think that would be best.
However I did run into issues when trying to run the Compare weighted and unweighted mean temperature example with cupy. In that example the weights
data array is generated from the latitude index. So the resulting DataArray
is backed by numpy. I would expect that if it were a cuDF Index it would end up as a cupy data array.
In my testing I just cast the weights
data array to cupy and carried on. So perhaps with this change users will just need to sprinkle some weights = weights.as_cupy()
or weights = weights.asarray(cp.ndarray)
type calls throughout their code when they need to.
Wanted to update on some recent work. With NEP35 (https://github.com/numpy/numpy/pull/16935, https://github.com/numpy/numpy/pull/17787) experimentally in NumPy, I've been exploring a bit more on using GPUs with xarray starting off with basic groupby workflows. There are places in the code where xarray calls pandas directly. For example, when building Indexes:
This is going to be challenging primarily because xarray will need to determine which DataFrame library to use during Index creation (possibly other DataFrame objects). While there is a consortium developing potential solutions for libraries to leverage multiple dataframe libraries I was going to keep hacking away and see what other issues may be lurking.
The indexes story will change soon, we may even have our own index classes.
We should have pretty decent support for NEP-18 arrays in DataArray.data
though, so IMO that's the best thing to try out and see where the issues remain.
NEP35 is cool; looks like we should use it in our *_like
functions.
Is your feature request related to a problem? Please describe.
In order to use Xarray with alternative array types like
cupy
the user needs to be able to specify the underlying array type without digging into internals.Right now I'm doing something like this.
However this will become burdensome when there are many data arrays and feels brittle and prone to errors.
As I see it a conversion could instead be done in a couple of places; on load, or as a utility method.
Currently Xarray supports NumPy and Dask array well. Numpy is the defrault and the way you specify whether a Dask array should be used is to give the
chunks
kwargs to anopen_
function or by calling.chunk()
on aDataSet
orDataArray
.Side note: There are a few places where the Dask array API bleeds into Xarray in order to have compatibility, the
chunk
kwarg/method is one, the.compute()
method is another. I'm hesitant to do this for other array types, however surfacing thecupy.ndarray.get
method could feel natural forcupy
users. But for now I think it would be best to take Dask as a special case and try and be generic for everything else.Describe the solution you'd like
For other array types I would like to propose the addition of an
asarray
kwarg for theopen_
methods and an.asarray()
method onDataSet
andDataArray
. This should take either the array typecupy.ndarray
, the asarray methodcp.asarray
, or preferably either.This would result in something like the following.
These operations would convert all data arrays to
cupy
arrays. For the case thatds
is backed by Dask arrays it would usemap_blocks
to cast each block to the appropriate array type.It is still unclear what to do about index variables, which are currently of type
pandas.Index
. Forcupy
it may be more appropriate to use acudf.Index
instead to ensure both are on the GPU. However this would add a dependency on cudf and potentially increase complexity here.Describe alternatives you've considered
Instead of an
asarray
kwarg/method something liketo_cupy
/from_cupy
could be done. However I feel this makes less sense because the object type is not changing, just that of the underlying data structure.Another option would be to go more high level with it. For example a
gpu
kwarg andto_gpu
/from_gpu
method could be added in the same way. This would abstract things even further and give users a choice about hardware rather than software. This would also be a fine solution but I think it may special case too much and a more generic solution would be better.Additional context Related to #4212.
I'm keen to start implementing this. But would like some discussion/feedback before I dive in here.