pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.55k stars 1.07k forks source link

Dataset drops non-dimension coordinate information for variables #8787

Closed Sevans711 closed 6 months ago

Sevans711 commented 6 months ago

What is your issue?

I am using xarray to look at 3d data. For my first pass through the data, I want to consider only a few 2d slices through the data (e.g., slices at x=0, y=0, z=0, and x=7). However, the sliced coordinates get dropped when they are combined into a Dataset. I expected those coordinates to not be dropped. See example below:

First, here is an example 3d array:

import numpy as np
import xarray as xr
nx, ny, nz = (32, 32, 32)
dx, dy, dz = (0.1, 0.1, 0.1)
coords=dict(x=np.arange(nx)*dx, y=np.arange(ny)*dy, z=np.arange(nz)*dz)
array = xr.DataArray(np.random.random((nx,ny,nz)), coords=coords)
array
<xarray.DataArray (x: 32, y: 32, z: 32)>
0.1485 0.2564 0.6394 0.2137 0.2473 0.3328 ... 0.5234 0.3744 0.487 0.6784 0.9909
Coordinates:
  * x        (x) float64 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ... 2.6 2.7 2.8 2.9 3.0 3.1
  * y        (y) float64 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ... 2.6 2.7 2.8 2.9 3.0 3.1
  * z        (z) float64 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ... 2.6 2.7 2.8 2.9 3.0 3.1

We can consider some slices. Each of these slices contains the coordinate along the sliced dimension (even though it is now a non-dimension coordinate). For example:

arr_x0 = array.isel(x=0)
arr_y0 = array.isel(y=0)
arr_z0 = array.isel(z=0)
arr_x7 = array.isel(x=7)

arr_x7  # note that the x coordinate is still in this array.
<xarray.DataArray (y: 32, z: 32)>
0.1985 0.9906 0.5257 0.899 0.1885 ... 0.6922 0.6347 0.04382 0.7082 0.04385
Coordinates:
    x        float64 0.7
  * y        (y) float64 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ... 2.6 2.7 2.8 2.9 3.0 3.1
  * z        (z) float64 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ... 2.6 2.7 2.8 2.9 3.0 3.1

We can also combine these slices into a Dataset:

ds = xr.Dataset(dict(var_x0=arr_x0, var_y0=arr_y0, var_z0=arr_z0, var_x7=arr_x7))
ds
<xarray.Dataset>
Dimensions:  (x: 32, y: 32, z: 32)
Coordinates:
  * x        (x) float64 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ... 2.6 2.7 2.8 2.9 3.0 3.1
  * y        (y) float64 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ... 2.6 2.7 2.8 2.9 3.0 3.1
  * z        (z) float64 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ... 2.6 2.7 2.8 2.9 3.0 3.1
Data variables:
    var_x0   (y, z) float64 0.1485 0.2564 0.6394 0.2137 ... 0.46 0.4438 0.6033
    var_y0   (x, z) float64 0.1485 0.2564 0.6394 0.2137 ... 0.8245 0.1887 0.9125
    var_z0   (x, y) float64 0.1485 0.7618 0.4953 0.1048 ... 0.2423 0.922 0.8528
    var_x7   (y, z) float64 0.1985 0.9906 0.5257 ... 0.04382 0.7082 0.04385

However, when getting the data vars from the dataset, we can see that the non-dimension coordinate information was discarded!

ds['var_x7']  # where did the x coordinate information go???
<xarray.DataArray 'var_x7' (y: 32, z: 32)>
0.1985 0.9906 0.5257 0.899 0.1885 ... 0.6922 0.6347 0.04382 0.7082 0.04385
Coordinates:
  * y        (y) float64 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ... 2.6 2.7 2.8 2.9 3.0 3.1
  * z        (z) float64 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ... 2.6 2.7 2.8 2.9 3.0 3.1

Is there any way to keep the non-dimension coordinate information? This will be important when I am trying to distinguish between different slices, especially different slices along the same dimension. In the example above, I would love to be able to tell that ds['var_x7'] is at x=0.7, and ds['var_x0'] is at x=0.0, without needing to track that information separately via clever choices for the names of the data vars.

Let me know if any of this was unclear or if I can provide more information!

max-sixty commented 6 months ago

Does arr_z0 = array.isel(z=[0]), with the list selection, help?

Sevans711 commented 6 months ago

Does arr_z0 = array.isel(z=[0]), with the list selection, help?

Sadly, not in the way I would want it to help. Because Datasets' variables must have the same length along each dimension (but only if they contain that dimension), this causes the data vars to all be 3d with a bunch of NaNs. This isn't a huge deal for my 32x32x32 example, but in practice it will be super important if I'm on a larger grid (e.g. 1024x1024x1024).

Example:

import numpy as np
import xarray as xr
nx, ny, nz = (32, 32, 32)
dx, dy, dz = (0.1, 0.1, 0.1)
coords=dict(x=np.arange(nx)*dx, y=np.arange(ny)*dy, z=np.arange(nz)*dz)
array = xr.DataArray(np.random.random((nx,ny,nz)), coords=coords)
array

arr_x0 = array.isel(x=[0])
arr_y0 = array.isel(y=[0])
arr_z0 = array.isel(z=[0])
arr_x7 = array.isel(x=[7])

ds = xr.Dataset(dict(var_x0=arr_x0, var_y0=arr_y0, var_z0=arr_z0, var_x7=arr_x7))

The individual arrays are fine; they are 32x32x1:

arr_x7

<xarray.DataArray (x: 1, y: 32, z: 32)>
0.4877 0.07715 0.1387 0.415 0.3065 ... 0.06881 0.2673 0.39 0.008193 0.6966
Coordinates:
  * x        (x) float64 0.7
  * y        (y) float64 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ... 2.6 2.7 2.8 2.9 3.0 3.1
  * z        (z) float64 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ... 2.6 2.7 2.8 2.9 3.0 3.1

However, the data vars from the dataset are not fine; they are fully 3d and filled with a bunch of nans, except along the sliced dimension:

ds['var_x7']  # NaNs everywhere except at x=0.7

<xarray.DataArray 'var_x7' (x: 32, y: 32, z: 32)>
nan nan nan nan nan nan nan nan nan nan ... nan nan nan nan nan nan nan nan nan
Coordinates:
  * x        (x) float64 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ... 2.6 2.7 2.8 2.9 3.0 3.1
  * y        (y) float64 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ... 2.6 2.7 2.8 2.9 3.0 3.1
  * z        (z) float64 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ... 2.6 2.7 2.8 2.9 3.0 3.1
max-sixty commented 6 months ago

Thanks, yes.

What's the result we would like here? Taking into account that the x dimension can't be different for different variables in a dataset....

Sevans711 commented 6 months ago

The result I would like is for Dataset to maintain the non-dimension coordinate information from individual data vars. In the example above, I would like to have ds['var_x7'] equivalent to arr_x7, including the x=0.7 coordinate information.

I understand the need to enforce that each dimension can't be different for different variables in a dataset. This is also clearly indicated in the documentation, so it is not surprising to me.

However, I don't really understand the need to enforce that non-dimension coordinates can't be different for different variables in a dataset. It's also not clearly described in the documentation. One option is to clarify this in the documentation. E.g. something like "The dimensions and coordinates are associated directly with the Dataset. Each variable might have only some of the dimensions in the Dataset, but it cannot have a different length along any dimension it contains. It also cannot store its own coordinates; its coordinates will be those associated with the Dataset."

Another option is to change the implementation to allow each variable to have its own coordinates. Would this be something the xarray team is willing to consider (or, is there some fundamental reason why we can't use different non-dimension coordinates for each variable in a Dataset)?

Also, I'm not attached to requiring that I use a Dataset for my purposes here. The main reason I wanted to use it is so that I can re-use all my code which does (physics-motivated) arithmetic on DataArray objects. Most of this code works exactly as expected when I use Dataset objects instead; the only issue I'm having is the coordinate labels on the Dataset. If anyone has a different data structure for manipulating multiple DataArrays, I'm happy to consider that instead!

max-sixty commented 6 months ago

In the example above, I would like to have ds['var_x7'] equivalent to arr_x7, including the x=0.7 coordinate information.

~But that's on the x dimension, no?~ Edit: it's not on the x dimension, it's a non-dimensioned scalar. Instead the constraint is that we don't have coords on data vars, we have them on the dataset. So it's not possible to combine multiple dataarrays with different coord values — what would the result look like?

Possibly changing this into something on the attrs of the variable would work?

keewis commented 6 months ago

If anyone has a different data structure for manipulating multiple DataArrays

I believe you might be able to use DataTree (from xarray-datatree, soon to be in xarray itself) for this

Sevans711 commented 6 months ago

Thank you, the message from @keewis seems to solve this issue and is what I'm looking for! And, what I'm trying to do here does indeed work with the current implementation of DataTree.

Feel free to mark this issue as closed, unless you think we should keep discussing the idea of modifying Dataset to allow for different coordinates attached to different data vars.