pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.62k stars 1.08k forks source link

apply_ufunc should preemptively broadcast #3032

Open OriolAbril opened 5 years ago

OriolAbril commented 5 years ago

Code Sample

I am having some troubles understanding apply_ufunc broadcasting rules. As I had some trouble understanding the docs, I am not 100% sure it is a bug, but I am quite sure. I will try to explain why with the following really simple example.

import xarray as xr
import numpy as np

a = xr.DataArray(data=np.random.normal(size=(7, 3)), dims=["dim1", "dim2"])
c = xr.DataArray(data=np.random.normal(size=(5, 6)), dims=["dim3", "dim4"])

def func(x,y):
    print(x.shape)
    print(y.shape)
    return

The function defined always raises an error when trying to call apply_ufunc, but this is intended, as the shapes have already been printed by then, and this keeps the example as simple as possible.

Problem description

xr.apply_ufunc(func, a, c)
# Out
# (7, 3, 1, 1)
# (5, 6)

Here, a has been kind of broadcasted, but I would expect the shapes of a and c to be the same as when calling xr.broadcast, as there are no input core dims, so all dimensions are broadcasted. However:

print([ary.shape for ary in xr.broadcast(a,c)])
# [(7, 3, 5, 6), (7, 3, 5, 6)]

Using different input core dims does not get rid of the problem, instead I believe it shows some more issues:

xr.apply_ufunc(func, a, c, input_core_dims=[["dim1"],[]])
# (3, 1, 1, 7), expected (3, 5, 6, 7)
# (5, 6), expected (3, 5, 6)

xr.apply_ufunc(func, a, c, input_core_dims=[[],["dim3"]])
# (7, 3, 1), expected (7, 3, 6)
# (6, 5), expected (7, 3, 6, 5)

xr.apply_ufunc(func, a, c, input_core_dims=[["dim1"],["dim3"]])
# (3, 1, 7), expected (3, 6, 7)
# (6, 5), expected (3, 6, 5)

Is this current behaviour what should be expected?

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.8 (default, Jan 14 2019, 11:02:34) [GCC 8.0.1 20180414 (experimental) [trunk revision 259383]] python-bits: 64 OS: Linux OS-release: 4.15.0-52-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.2 libnetcdf: 4.6.3 xarray: 0.12.1 pandas: 0.24.2 numpy: 1.16.4 scipy: 1.3.0 netCDF4: 1.5.1.2 pydap: None h5netcdf: None h5py: 2.9.0 Nio: None zarr: None cftime: 1.0.3.4 nc_time_axis: None PseudonetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: None distributed: None matplotlib: 3.1.0 cartopy: None seaborn: None setuptools: 41.0.0 pip: 19.1.1 conda: None pytest: 4.5.0 IPython: 7.5.0 sphinx: 2.0.1
max-sixty commented 5 years ago

Thanks for the issue & code sample @OriolAbril

IIUC, func needs to do this broadcasting itself; from the apply_ufunc docstring:

func : callable
    Function to call like ``func(*args, **kwargs)`` on unlabeled arrays
    (``.data``) that returns an array or tuple of arrays. If multiple
    arguments with non-matching dimensions are supplied, this function is
    expected to vectorize (broadcast) over axes of positional arguments in
    the style of NumPy universal functions [1]_ (if this is not the case,
    set ``vectorize=True``). If this function returns multiple outputs, you
    must set ``output_core_dims`` as well.
OriolAbril commented 5 years ago

Then shouldn't a in the first example keep its original shape?

max-sixty commented 5 years ago

Because func receives unlabelled arrays, apply_ufunc aligns the axes order. So if they share a dimension:

In [8]: import xarray as xr
   ...: import numpy as np
   ...:
   ...: a = xr.DataArray(data=np.random.normal(size=(7, 3)), dims=["dim1", "dim2"])
   ...: c = xr.DataArray(data=np.random.normal(size=(7, 6)), dims=["dim1", "dim4"])  # <- change here
   ...:
   ...: def func(x,y):
   ...:     print(x.shape)
   ...:     print(y.shape)
   ...:     return x
   ...:

In [9]: xr.apply_ufunc(func, a, c)
(7, 3, 1)
(7, 1, 6)

...otherwise func wouldn't know how to align.

Another option would be for your original example to put lengths of 1 in all axes, rather than only 'forward filling', e.g.

xr.apply_ufunc(func, a, c)
# Out
# (7, 3, 1, 1)
# (1, 1, 5, 6) # <- change here

I think it operates without that step because functions 'in the wild' generally will handle that themselves, but that's a guess and needs someone who knows this better to weight in

shoyer commented 5 years ago

For what it's worth, I agree that this behavior is a little surprising. We should probably make apply_ufunc() explicitly broadcast arrays first.

max-sixty commented 5 years ago

We should probably make apply_ufunc() explicitly broadcast arrays first.

To confirm, so that we have something like this?

xr.apply_ufunc(func, a, c)
# Out
# (7, 3, 5, 6)
# (7, 3, 5, 6) 
shoyer commented 5 years ago

Yes, exactly.

max-sixty commented 5 years ago

I'm trying to think whether there would be any performance cost there - i.e. are there any arrays where preemptive broadcasting would be both expensive and unnecessary?

OriolAbril commented 5 years ago

I'm trying to think whether there would be any performance cost there - i.e. are there any arrays where preemptive broadcasting would be both expensive and unnecessary?

Even if there were a performance cost (compared to the actual behaviour), it could be easily avoided by using all dims as input_core_dims couldn't it? IIUC, all dims should be broadcasted unless they are in input core dims, so it broadcasting could still be avoided without problem.

shoyer commented 5 years ago

With NumPy arrays at least, there is no cost for broadcasting, because it can always be done with views. But even for other array types, inserting size 1 dimensions in the correct location should be basically free, and would be more helpful than what we currently do

On Wed, Jun 19, 2019 at 9:25 PM Oriol Abril notifications@github.com wrote:

I'm trying to think whether there would be any performance cost there - i.e. are there any arrays where preemptive broadcasting would be both expensive and unnecessary?

Even if there were a performance cost (compared to the actual behaviour), it could be easily avoided by using all dims as input_core_dims couldn't it? IIUC, all dims should be broadcasted unless they are in input core dims, so it broadcasting could still be avoided without problem.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/3032?email_source=notifications&email_token=AAJJFVUDJUKKVSXB5DQONF3P3J2YXA5CNFSM4HZEHCZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYCXX4Y#issuecomment-503675891, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJJFVTEQBAYTYROYMQNGC3P3J2YXANCNFSM4HZEHCZA .

max-sixty commented 5 years ago

@shoyer thanks for the clarity

@OriolAbril would you mind if we changed this issue to "apply_ufunc should preemptively broadcast"

OriolAbril commented 5 years ago

@max-sixty Not at all, whatever is best. I actually opened the issue without being 100% it was one.