Open TomNicholas opened 6 months ago
I find myself defining two versions of functions often because I need to use the numpy versions with scipy.optimise and use the xarray versions as a convenience because it is cleaner with named dims. So this topic interests me. Dressing a numpy with Variable and using an xarray function is slower that the apply_ufunc and both slower than using a numpy ufunc.
rnd = np.random.default_rng()
x1 = rnd.normal(size=(10, 100))
f1 = lambda x: x.mean(axis=-1)
x2 = xr.Variable(dims=["a", "b"], data=x1)
f2 = lambda x: xr.apply_ufunc(np.mean, x,
input_core_dims=[["b"]],
output_core_dims=[[]],
output_dtypes=[float],
kwargs={"axis": -1})
%timeit f1(x1)
%timeit f2(x2)
%timeit f2(x1)
Hi @gmoutso - I'm not clear from your comment whether you are for or against this change!
Dressing a numpy with Variable and using an xarray function is slower that the apply_ufunc and both slower than using a numpy ufunc.
Well yes, there will be some overhead from wrapping with a higher-level API. But that overhead shouldn't scale with the size of the array. Here's results when I tried your just now for arrays of different sizes.
sizes = [10, 100, 1000, 10000]
import time
times_np_fn_np_arr = []
times_xr_fn_np_arr = []
times_xr_fn_xr_var = []
for size in sizes:
np_arr = rnd.normal(size=(size, 10 * size))
xr_var = xr.Variable(dims=["a", "b"], data=np_arr)
start_time = time.time()
np_fn(np_arr)
times_np_fn_np_arr.append(time.time() - start_time)
start_time = time.time()
xr_fn(np_arr)
times_xr_fn_np_arr.append(time.time() - start_time)
start_time = time.time()
xr_fn(xr_var)
times_xr_fn_xr_var.append(time.time() - start_time)
import matplotlib.pyplot as plt
# Plot the results
plt.figure(figsize=(7, 5))
plt.plot(sizes, times_np_fn_np_arr, label='np_fn_np_arr', marker='o')
plt.plot(sizes, times_xr_fn_np_arr, label='xr_fn_np_arr', marker='o')
plt.plot(sizes, times_xr_fn_xr_var, label='xr_fn_xr_var', marker='o')
plt.xlabel('Size')
plt.ylabel('Execution Time (seconds)')
plt.title('Execution Time vs. Size for Different Functions')
plt.xscale('log')
plt.yscale('log')
plt.legend()
plt.grid(True)
plt.show()
As expected they converge to the same execution time, so I don't think this overhead is a problem.
Hi @TomNicholas . Thank you for your reply and benchmarks.
I'm not clear from your comment whether you are for or against this change!
Actually I am neither for or against. I wanted to register why this topic would interest me, even if it is not clear to me either!
Currently I define two versions of functions for the same calculation. One for numpy for use in atomic calculations that will be wrapped with a scipy.optimise
function. The other for xarray because it is easier to write, clearer to use, and much easier to use with xarray arrays. Although the time difference is due to the overhead (as your plots suggest), this is not good enough for me because scipy.optimise will call the function many many times for a single call. Eg minimise an output (here an atomic float as all output dims are output_core_dims that are none - the objective of minimise).
Using xarray functions with numpy arrays (as the topic is here) is potentially useful but too slow in my use compared to pure numpy. I am not suggesting a solution, only registering my interest if useful to the discussion. Thank you!
What is your issue?
@keewis pointed out that it's weird that
xarray.apply_ufunc
supports passing numpy/dask arrays directly, and I'm inclined to agree. I don't understand why we do, and think we should consider removing that feature.Two arguments in favour of removing it:
1) It exposes users to transposition errors
Consider this example:
Transposing the input leads to a different result, with the value of the
dim
kwarg effectively ignored. This kind of error is what xarray code is supposed to prevent by design.2) There is an alternative input pattern that doesn't require accepting bare arrays
Instead, any numpy/dask array can just be wrapped up into an xarray
Variable
/NamedArray
before passing it toapply_ufunc
.This now guards against the transposition error, and puts the onus on the user to be clear about which axes of their array correspond to which dimension.
With
Variable
/NamedArray
as public API, this latter pattern can handle every case that passing bare arrays in could.I suggest we deprecate accepting bare arrays in favour of having users wrap them in
Variable
/NamedArray
/DataArray
objects instead.(Note 1: We also accept raw scalars, but this doesn't expose anyone to transposition errors.)
(Note 2: In a quick scan of the
apply_ufunc
docstring, the docs on it incomputation.rst
, and the extensive guide that @dcherian wrote in the xarray tutorial repository, I can't see any examples that actually pass bare arrays toapply_ufunc
.)