pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.57k stars 1.07k forks source link

Keep attrs & Add a 'keep_coords' argument to Dataset.apply #688

Closed max-sixty closed 12 months ago

max-sixty commented 8 years ago

Generally this isn't a problem, since the coords are carried over by the resulting DataArrays:

In [11]:

ds = xray.Dataset({
        'a':pd.DataFrame(pd.np.random.rand(10,3)),
        'b':pd.Series(pd.np.random.rand(10))
    })
ds.coords['c'] = pd.Series(pd.np.random.rand(10))
ds
Out[11]:
<xray.Dataset>
Dimensions:  (dim_0: 10, dim_1: 3)
Coordinates:
  * dim_0    (dim_0) int64 0 1 2 3 4 5 6 7 8 9
  * dim_1    (dim_1) int64 0 1 2
    c        (dim_0) float64 0.9318 0.2899 0.3853 0.6235 0.9436 0.7928 ...
Data variables:
    a        (dim_0, dim_1) float64 0.5707 0.9485 0.3541 0.5987 0.406 0.7992 ...
    b        (dim_0) float64 0.4106 0.2316 0.5804 0.6393 0.5715 0.6463 ...
In [12]:

ds.apply(lambda x: x*2)
Out[12]:
<xray.Dataset>
Dimensions:  (dim_0: 10, dim_1: 3)
Coordinates:
    c        (dim_0) float64 0.9318 0.2899 0.3853 0.6235 0.9436 0.7928 ...
  * dim_0    (dim_0) int64 0 1 2 3 4 5 6 7 8 9
  * dim_1    (dim_1) int64 0 1 2
Data variables:
    a        (dim_0, dim_1) float64 1.141 1.897 0.7081 1.197 0.812 1.598 ...
    b        (dim_0) float64 0.8212 0.4631 1.161 1.279 1.143 1.293 0.3507 ...

But if there's an operation that removes the coords from the DataArrays, the coords are not there on the result (notice c below). Should the Dataset retain them? Either always or with a keep_coords argument, similar to keep_attrs.

In [13]:

ds = xray.Dataset({
        'a':pd.DataFrame(pd.np.random.rand(10,3)),
        'b':pd.Series(pd.np.random.rand(10))
    })
ds.coords['c'] = pd.Series(pd.np.random.rand(10))
ds
Out[13]:
<xray.Dataset>
Dimensions:  (dim_0: 10, dim_1: 3)
Coordinates:
  * dim_0    (dim_0) int64 0 1 2 3 4 5 6 7 8 9
  * dim_1    (dim_1) int64 0 1 2
    c        (dim_0) float64 0.4121 0.2507 0.6326 0.4031 0.6169 0.441 0.1146 ...
Data variables:
    a        (dim_0, dim_1) float64 0.4813 0.2479 0.5158 0.2787 0.06672 ...
    b        (dim_0) float64 0.2638 0.5788 0.6591 0.7174 0.3645 0.5655 ...
In [14]:

ds.apply(lambda x: x.to_pandas()*2)
Out[14]:
<xray.Dataset>
Dimensions:  (dim_0: 10, dim_1: 3)
Coordinates:
  * dim_0    (dim_0) int64 0 1 2 3 4 5 6 7 8 9
  * dim_1    (dim_1) int64 0 1 2
Data variables:
    a        (dim_0, dim_1) float64 0.9627 0.4957 1.032 0.5574 0.1334 0.8289 ...
    b        (dim_0) float64 0.5275 1.158 1.318 1.435 0.7291 1.131 0.1903 ...
shoyer commented 8 years ago

I would be fine with a keep_coords argument.

I'm wary of always keeping coordinates, because some applied operations could make existing coordinates no longer valid. For example, suppose you want to use pandas's faster time-resampling, i.e., ds.apply(lambda x: x.to_pandas().resample('24H')). Any coordinates along the time would no longer be valid. We could automatically align the coordinates, but that starts to get increasingly magical...

max-sixty commented 8 years ago

Great @shoyer, agreed

max-sixty commented 8 years ago

Also attrs get cleared, which I think should be retained by default?

snowman2 commented 7 years ago

Is there plans for a 'keep_coords' for Dataset.resample as well?

shoyer commented 7 years ago

@snowman2 Possibly yes, though we would want to think through the use-cases for this first. Arguably, you should explicitly preserve coordinates in your custom callable instead.

snowman2 commented 7 years ago

You could do it in the custom callable, but it requires less expertise and fewer lines of code to add that as an option. The use case I have is land surface model output with x,y coordinates that I would like to preserve.

shoyer commented 7 years ago

@snowman2 Can you give a concrete example of the sort of function you would want to apply?

snowman2 commented 7 years ago

I need input data for a hydrology model in an hourly timestep. So, I use the Dataset.resample method on data from land surface models to achieve that. Then, I use a custom linear interpolation to fill in the nan's. I then write out the data to a file. It is easier to write the resampled dataset to the file with the necessary information if the x,y coordinates are not removed in the Dataset.resample method.

shoyer commented 7 years ago

@snowman2 I tried to reproduce your issue, but I couldn't make resample drop coordinates:

In [21]: ds = xarray.tutorial.load_dataset('rasm')

In [22]: ds.resample('AS', 'time', how=np.sum)
Out[22]:
<xarray.Dataset>
Dimensions:  (time: 4, x: 275, y: 205)
Coordinates:
    yc       (y, x) float64 16.53 16.78 17.02 17.27 17.51 17.76 18.0 18.25 ...
    xc       (y, x) float64 189.2 189.4 189.6 189.7 189.9 190.1 190.2 190.4 ...
  * time     (time) datetime64[ns] 1980-01-01 1981-01-01 1982-01-01 1983-01-01
Dimensions without coordinates: x, y
Data variables:
    Tair     (time, y, x) float64 nan nan nan nan nan nan nan nan nan nan ...
snowman2 commented 7 years ago

@shoyer, thanks for looking into it. I am resampling from 3hr data to 1hr data.

resampled_ds = ds.resample('1H', dim='time', keep_attrs=True)

I am using it here: https://github.com/CI-WATER/gsshapy/blob/f4e5cb13c1d528021e1953859b712553a4162311/gsshapy/grid/grid_to_gssha.py#L789-L844

I ran into the issue there and had to add code to make sure the coordinates were copied.

Thanks!

shoyer commented 7 years ago

@snowman2 can you print an example of what self.data looks like? And desired vs. actual output if you remove those lines to add in the coordinates manually?

snowman2 commented 7 years ago

Strange. But I can't seem to re-produce the issue. Maybe it was on a Windows machine or maybe it is fixed now.

stale[bot] commented 5 years ago

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

max-sixty commented 12 months ago

Closing as stale