pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.62k stars 1.09k forks source link

Why does xr.apply_ufunc support numpy/dask.arrays? #8995

Open TomNicholas opened 6 months ago

TomNicholas commented 6 months ago

What is your issue?

@keewis pointed out that it's weird that xarray.apply_ufunc supports passing numpy/dask arrays directly, and I'm inclined to agree. I don't understand why we do, and think we should consider removing that feature.

Two arguments in favour of removing it:

1) It exposes users to transposition errors

Consider this example:

In [1]: import xarray as xr

In [2]: import numpy as np

In [3]: arr = np.arange(12).reshape(3, 4)

In [4]: def mean(obj, dim):
   ...:     # note: apply always moves core dimensions to the end
   ...:     return xr.apply_ufunc(
   ...:         np.mean, obj, input_core_dims=[[dim]], kwargs={"axis": -1}
   ...:     )
   ...: 

In [5]: mean(arr, dim='time')
Out[5]: array([1.5, 5.5, 9.5])

In [6]: mean(arr.T, dim='time')
Out[6]: array([4., 5., 6., 7.])

Transposing the input leads to a different result, with the value of the dim kwarg effectively ignored. This kind of error is what xarray code is supposed to prevent by design.

2) There is an alternative input pattern that doesn't require accepting bare arrays

Instead, any numpy/dask array can just be wrapped up into an xarray Variable/NamedArray before passing it to apply_ufunc.

In [7]: from xarray.core.variable import Variable

In [8]: var = Variable(data=arr, dims=['time', 'space'])

In [9]: mean(var, dim='time')
Out[9]: 
<xarray.Variable (space: 4)> Size: 32B
array([4., 5., 6., 7.])

In [10]: mean(var.T, dim='time')
Out[10]: 
<xarray.Variable (space: 4)> Size: 32B
array([4., 5., 6., 7.])

This now guards against the transposition error, and puts the onus on the user to be clear about which axes of their array correspond to which dimension.

With Variable/NamedArray as public API, this latter pattern can handle every case that passing bare arrays in could.

I suggest we deprecate accepting bare arrays in favour of having users wrap them in Variable/NamedArray/DataArray objects instead.

(Note 1: We also accept raw scalars, but this doesn't expose anyone to transposition errors.)

(Note 2: In a quick scan of the apply_ufunc docstring, the docs on it in computation.rst, and the extensive guide that @dcherian wrote in the xarray tutorial repository, I can't see any examples that actually pass bare arrays to apply_ufunc.)

gmoutso commented 5 months ago

I find myself defining two versions of functions often because I need to use the numpy versions with scipy.optimise and use the xarray versions as a convenience because it is cleaner with named dims. So this topic interests me. Dressing a numpy with Variable and using an xarray function is slower that the apply_ufunc and both slower than using a numpy ufunc.

rnd = np.random.default_rng()
x1 = rnd.normal(size=(10, 100))
f1 = lambda x: x.mean(axis=-1)
x2 = xr.Variable(dims=["a", "b"], data=x1)
f2 = lambda x: xr.apply_ufunc(np.mean, x,
                              input_core_dims=[["b"]],
                              output_core_dims=[[]],
                              output_dtypes=[float],
                              kwargs={"axis": -1})
%timeit f1(x1)
%timeit f2(x2)
%timeit f2(x1)
TomNicholas commented 5 months ago

Hi @gmoutso - I'm not clear from your comment whether you are for or against this change!

Dressing a numpy with Variable and using an xarray function is slower that the apply_ufunc and both slower than using a numpy ufunc.

Well yes, there will be some overhead from wrapping with a higher-level API. But that overhead shouldn't scale with the size of the array. Here's results when I tried your just now for arrays of different sizes.

sizes = [10, 100, 1000, 10000]
import time

times_np_fn_np_arr = []
times_xr_fn_np_arr = []
times_xr_fn_xr_var = []
for size in sizes:
    np_arr = rnd.normal(size=(size, 10 * size))
    xr_var = xr.Variable(dims=["a", "b"], data=np_arr)

    start_time = time.time()
    np_fn(np_arr)
    times_np_fn_np_arr.append(time.time() - start_time)

    start_time = time.time()
    xr_fn(np_arr)
    times_xr_fn_np_arr.append(time.time() - start_time)

    start_time = time.time()
    xr_fn(xr_var)
    times_xr_fn_xr_var.append(time.time() - start_time)
import matplotlib.pyplot as plt

# Plot the results
plt.figure(figsize=(7, 5))
plt.plot(sizes, times_np_fn_np_arr, label='np_fn_np_arr', marker='o')
plt.plot(sizes, times_xr_fn_np_arr, label='xr_fn_np_arr', marker='o')
plt.plot(sizes, times_xr_fn_xr_var, label='xr_fn_xr_var', marker='o')
plt.xlabel('Size')
plt.ylabel('Execution Time (seconds)')
plt.title('Execution Time vs. Size for Different Functions')
plt.xscale('log')
plt.yscale('log')
plt.legend()
plt.grid(True)
plt.show()
Screenshot 2024-06-12 at 11 13 53 AM

As expected they converge to the same execution time, so I don't think this overhead is a problem.

gmoutso commented 4 months ago

Hi @TomNicholas . Thank you for your reply and benchmarks.

I'm not clear from your comment whether you are for or against this change!

Actually I am neither for or against. I wanted to register why this topic would interest me, even if it is not clear to me either!

Currently I define two versions of functions for the same calculation. One for numpy for use in atomic calculations that will be wrapped with a scipy.optimise function. The other for xarray because it is easier to write, clearer to use, and much easier to use with xarray arrays. Although the time difference is due to the overhead (as your plots suggest), this is not good enough for me because scipy.optimise will call the function many many times for a single call. Eg minimise an output (here an atomic float as all output dims are output_core_dims that are none - the objective of minimise).

Using xarray functions with numpy arrays (as the topic is here) is potentially useful but too slow in my use compared to pure numpy. I am not suggesting a solution, only registering my interest if useful to the discussion. Thank you!