pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.6k stars 1.08k forks source link

merge and align DataArrays/Datasets on different domains #742

Closed jcmgray closed 7 years ago

jcmgray commented 8 years ago

Firstly, I think xarray is great and for the type of physics simulations I run n-dimensional labelled arrays is exactly what I need. But, and I may be missing something, is there a way to merge (or concatenate/update) DataArrays with different domains on the same coordinates?

For example consider this setup:

import xarray as xr

x1 = [100]
y1 = [1, 2, 3, 4, 5]
dat1 = [[101, 102, 103, 104, 105]]

x2 = [200]
y2 = [3, 4, 5, 6]  # different size and domain
dat2 = [[203, 204, 205, 206]]

da1 = xr.DataArray(dat1, dims=['x', 'y'], coords={'x': x1, 'y': y1})
da2 = xr.DataArray(dat2, dims=['x', 'y'], coords={'x': x2, 'y': y2})

I would like to aggregate such DataArrays into a new, single DataArray with nan padding such that:

>>> merge(da1, da2, align=True)  # made up syntax
<xarray.DataArray (x: 2, y: 6)>
array([[ 101.,  102.,  103.,  104.,  105.,   nan],
       [  nan,   nan,  203.,  204.,  205.,  206.]])
Coordinates:
  * x        (x) int64 100 200
  * y        (y) int64 1 2 3 4 5 6

Here is a quick function I wrote to do such but I would worried about the performance of 'expanding' the new data to the old data's size every iteration (i.e. supposing that the first argument is a large DataArray that you are adding to but doesn't necessarily contain the dimensions already).

def xrmerge(*das, accept_new=True):
    da = das[0]
    for new_da in das[1:]:
        # Expand both to have same dimensions, padding with NaN
        da, new_da = xr.align(da, new_da, join='outer')
        # Fill NaNs one way or the other re. accept_new
        da = new_da.fillna(da) if accept_new else da.fillna(new_da)
    return da

Might this be (or is this already!) possible in simpler form in xarray? I know Datasets have merge and update methods but I couldn't make them work as above. I also notice there are possible plans ( #417 ) to introduce a merge function for DataArrays.

shoyer commented 8 years ago

This is actually closer to the functionality of concat than merge. Hypothetically, something like the following would do what you want:

# note: this is *not* valid syntax currently! the dims arguments
# does not yet exist.
# this would hypothetically only align along the 'y' dimension, not 'x'
aligned = xr.align(*das, join='outer', dims='y')
combined = xr.concat(aligned, dim='x')

In cases where each array does not already have the dimension you want to concat along, this already works fine, because you can simply omit dims in align.

JamesPHoughton commented 8 years ago

I'm having a similar issue, expanding the complexity in that I want to concatenate across multiple dimensions. I'm not sure if that's a cogent way to explain it, but here's an example. I have:

m = xr.DataArray(data=[[[1.1, 1.2, 1.3]]], 
                 coords={'Dim1': ['A', 'B', 'C'], 'Dim2':['D'], 'Dim3':['F']})
n = xr.DataArray(data=[[[2.1, 2.2, 2.3]]], 
                 coords={'Dim1': ['A', 'B', 'C'], 'Dim2':['E'], 'Dim3':['F']})
o = xr.DataArray(data=[[[3.1, 3.2, 3.3]]], 
                 coords={'Dim1': ['A', 'B', 'C'], 'Dim2':['D'], 'Dim3':['G']})
p = xr.DataArray(data=[[[4.1, 4.2, 4.3]]], 
                 coords={'Dim1': ['A', 'B', 'C'], 'Dim2':['E'], 'Dim3':['G']})

Which I want to merge into a single, fully populated array similar to what I'd get if I did:

data =[[[ 1.1,  1.2,  1.3],
        [ 3.1,  3.2,  3.3]],

       [[ 2.1,  2.2,  2.3],
        [ 4.1,  4.2,  4.3]]]

xr.DataArray(data=data, 
             coords={'Dim1': ['A', 'B', 'C'], 'Dim2':['D', 'E'], 'Dim3':['F', 'G']})

i.e.

<xarray.DataArray (Dim2: 2, Dim3: 2, Dim1: 3)>
array([[[ 1.1,  1.2,  1.3],
        [ 3.1,  3.2,  3.3]],

       [[ 2.1,  2.2,  2.3],
        [ 4.1,  4.2,  4.3]]])
Coordinates:
  * Dim2     (Dim2) |S1 'D' 'E'
  * Dim3     (Dim3) |S1 'F' 'G'
  * Dim1     (Dim1) |S1 'A' 'B' 'C'

@jcmgray's function is pretty close, although the array indicies are described slightly differently (I'm not sure if this is a big deal or not...). Note the 'object' type for Dim2 and Dim3:

<xarray.DataArray (Dim2: 2, Dim3: 2, Dim1: 3)>
array([[[ 1.1,  1.2,  1.3],
        [ 3.1,  3.2,  3.3]],

       [[ 2.1,  2.2,  2.3],
        [ 4.1,  4.2,  4.3]]])
Coordinates:
  * Dim2     (Dim2) object 'D' 'E'
  * Dim3     (Dim3) object 'F' 'G'
  * Dim1     (Dim1) |S1 'A' 'B' 'C'

It would be great to have a canonical way to do this. What should I try?

jcmgray commented 8 years ago

Just a comment that the appearance of object types is likely due to the fact that numpy's NaNs are inherently 'floats' - so this will be an issue for any method with an intermediate `missing data' stage if non-floats are being used.

I still use use the align and fillna method since I mostly deal with floats/complex numbers, although @shoyer 's suggestion of a partial align and then concat could definitely be cleaner when the added coordinates are all 'new'.

shoyer commented 8 years ago

I think this could make it into merge, which I am in the process of refactoring in https://github.com/pydata/xarray/pull/857.

The key difference from @jcmgray's implementation that I would want is a check to make sure that the data is all on different domains when using fillna. merge should not run the risk of removing non-NaN data.

@JamesPHoughton I agree with @jcmgray that the dtype=object is what you should expect here. It's hard to create fixed length strings in xarray/pandas because that precludes the possibility of missing values, so we tend to convert strings to object dtype when merged/concatenated.

JamesPHoughton commented 8 years ago

Something akin to the pandas dataframe update would have value - then you could create an empty array structure and populate it as necessary:

import pandas as pd
df = pd.DataFrame(index=range(5), columns=['a','b','c','d'])
df2 = pd.DataFrame(index=range(3), columns=['a'], data=range(3))
df.update(df2)
     a    b    c    d
0    0  NaN  NaN  NaN
1    1  NaN  NaN  NaN
2    2  NaN  NaN  NaN
3  NaN  NaN  NaN  NaN
4  NaN  NaN  NaN  NaN

But, not sure if empty array construction is supported?

jcmgray commented 8 years ago

Yes following a similar line of thought to you I recently wrote an 'all missing' dataset constructor (rather than 'empty' which I think of as no variables):

def all_missing_ds(coords, var_names, var_dims, var_types):
    """
    Make a dataset whose data is all missing.
    """
    # Empty dataset with appropirate coordinates
    ds = xr.Dataset(coords=coords)
    for v_name, v_dims, v_type in zip(var_names, var_dims, var_types):
        shape = tuple(ds[d].size for d in v_dims)
        if v_type == int or v_type == float:
            # Warn about up-casting int to float?
            nodata = np.tile(np.nan, shape)
        elif v_type == complex:
            # astype(complex) produces (nan + 0.0j)
            nodata = np.tile(np.nan + np.nan*1.0j, shape)
        else:
            nodata = np.tile(np.nan, shape).astype(object)
        ds[v_name] = (v_dims, nodata)
    return ds

To go with this (and this might be separate issue), a set_value method would be helpful --- just so that one does not have to remember which particular combination of

ds.sel(...).var = new_values
ds.sel(...)['var'] = new_values
ds.var.sel(...) = new_values
ds['var'].sel(...) = new_values

guarantees assigning a new value, (currently only the last syntax I believe).

shoyer commented 8 years ago

@JamesPHoughton @jcmgray For empty array creation, take a look at https://github.com/pydata/xarray/issues/277 and https://github.com/pydata/xarray/issues/878 -- this functionality would certainly be welcome.

To go with this (and this might be separate issue), a set_value method would be helpful --- just so that one does not have to remember which particular combination of...

@jcmgray Beware -- none of these are actually supported! See the big warning here in the docs. If you think a set_value method would be a better reminder than such warnings in the docs I would be totally open to it. But let's open another issue to discuss it.

jcmgray commented 8 years ago

Woops - I actually meant to put

ds['var'].loc[{...}]

in there as the one that works ... my understanding is that this is supported as long as the specified coordinates are 'nice' (according to pandas) slices/scalars.

And yes, default values for DataArray/Dataset would definitely fill the "create_all_missing" need.

jcmgray commented 8 years ago

@shoyer My 2 cents for how this might work after 0.8+ (auto-align during concat, merge and auto_combine goes a long to solving this already) is that the compat option of merge etc could have a 4th option 'nonnull_equals' (or better named...), with compatibility tested by e.g.

import xarray.ufuncs as xrufuncs

def nonnull_compatible(first, second):
    """ Check whether two (aligned) datasets have any conflicting non-null values. """

    # mask for where both objects are not null
    both_not_null = xrufuncs.logical_not(first.isnull() | second.isnull())

    # check remaining values are equal
    return first.where(both_not_null).equals(second.where(both_not_null))

And then fillna to combine variables. Looking now I think this is very similar to what you are suggesting in #835.

shoyer commented 8 years ago

@jcmgray Yes, that looks about right to me. The place to add this in would be the unique_variable function: https://github.com/pydata/xarray/blob/master/xarray/core/merge.py#L39

I would use 'notnull_equals' rather than 'nonnull_equals' just because that's the pandas term.

shoyer commented 7 years ago

Fixed by #996