pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.57k stars 1.07k forks source link

Support __matmul__ operator (@) #1053

Closed chris-b1 closed 5 years ago

chris-b1 commented 7 years ago

xref https://github.com/pandas-dev/pandas/issues/10259

Presumably deferring to the semantics of np.matmul - not sure if that API is stable yet?

shoyer commented 7 years ago

For xarray, probably the right choice is for @ to be an alias for .dot(): http://xarray.pydata.org/en/stable/generated/xarray.DataArray.dot.html

The broadcasting semantics of np.matmul don't quite make sense because it broadcasts based on axis position, not name.

dhimmel commented 7 years ago

Would love support for PEP 465 @ notation.

Recently, @ came in handy when multiplying numpy.ndarray with scipy.sparse matrices. We're considering xarray for our project and compatibility with this unified operator would be a real plus!

dhimmel commented 7 years ago

More specifically, I'd like to be able to do matrix multiplication between numpy ndarrays / matrices, scipy sparse matrices, and xarray DataArrays. @ seems like the most natural operator to enable this cross-package compatibility.

shoyer commented 7 years ago

More specifically, I'd like to be able to do matrix multiplication between numpy ndarrays / matrices, scipy sparse matrices, and xarray DataArrays.

I'm intrigued, but how would this work? data_array + numpy_array yields a result with well-defined labels as long as numpy_array broadcasts against data_array.data, but data_array @ numpy_array does not if numpy_array has 2 or more dimensions.

I guess we could prohibit @ with non-vector other arguments, but I still am concerned that the suggested meaning of @ per PEP 465 and numpy depends on the order of array dimensions. Basically, the last dimension of the left-hand-side argument should be matched against the second-to-last (or last, for 1D) dimension of the right-hand-side for the tensor contraction. In xarray terms, we could match the last dimension of the left-hand-side with any matching dimensions (by name) of the right-hand-side, but it's still messily inconsistent with other xarray operations, which are generally agnostic to to dimension order.

It also gets messy on Dataset objects, because the order of dimensions now becomes a bit more ambiguous: there's the order of dimensions on the Dataset itself, and the order on each DataArray in the dataset.

For these reasons, I'm leaning towards thinking that @ should be defined differently for xarray, and work like tensordot over all matching dimensions.

dhimmel commented 7 years ago

First let me say, I know python, but I don't know linear algebra (I rely on @kkloste for algebra). I'm also new to xarray and recently used it for the first time to represent a hetnet (network with multiple node and relationship types) as a xarray.DataSet where each DataArray is an adjacency matrix (0 or 1 for whether an edge exists) for a specific edge type. I was drawn to xarray because it allows us to:

  1. assign row/column labels (representing node identity) to 2D arrays (adjacency matrixes in our case)
  2. reason across multiple adjacency matrixes by assigning dimension identities (node types)

The operations that we're using for our project are dot-product multiplying 2D arrays by 2D arrays and 1D arrays by 2D arrays. Currently, our arrays are numpy.ndarrays, but we may switch some of our 2D arrays to scipy.sparse matrices.

I'm intrigued, but how would this work? data_array + numpy_array yields a result with well-defined labels as long as numpy_array broadcasts against data_array.data, but data_array @ numpy_array does not if numpy_array has 2 or more dimensions.

My intuition was that we use @ on a DataArray in cases where DataArray.values @ numpy.ndarray or numpy.ndarray @ DataArray.values would work. In these situations, the user would be responsible for ensuring numpy.ndarray had the correct coordinates and dimensions. We're also interested in DataArray.values @ scipy.sparse.

However, it appears that xarray may do some inference based aligning dimensions/coordinates... and that I need to understand this process a bit more. Sorry if this reply doesn't help you move forward with this issue. I hopefully will be able to be more helpful as I become more familiar with xarray.

It also gets messy on Dataset objects

For clarity, I wasn't thinking of using @ on Datasets.

shoyer commented 7 years ago

My intuition was that we use @ on a DataArray in cases where DataArray.values @ numpy.ndarray or numpy.ndarray @ DataArray.values would work.

Suppose data_array is a DataArray with dimensions ['x', 'y'] and numpy_array is a numpy.ndarray with a compatible shape. What should data_array @ numpy_array look like? The first dimension should be labeled x, but the second dimension doesn't have a name, so we'd need to come up with one somehow (every dimension in an DataArray must have a name).

However, it appears that xarray may do some inference based aligning dimensions/coordinates... and that I need to understand this process a bit more.

Indeed, see http://xarray.pydata.org/en/stable/computation.html#broadcasting-by-dimension-name

Hoeze commented 6 years ago

How about just keeping the current behavior? Currently a @ b just returns a new numpy array if either a or b is no xr.DataArray. This makes perfectly sense to me.

If both arrays are xr.DataArrays, I get an error which was rather unexpected. Here, xarray could simply stick to xr.DataArray.dot().

shoyer commented 6 years ago

Yes, we could definitely make @ between two xarray objects equivalent to xarray.dot().

max-sixty commented 5 years ago

Closed by #2987