pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.55k stars 1.07k forks source link

open_mfdataset cannot open multiple netcdf files written by NASA/GEOS MAPL v1.0.0 that contain data on a cubed-sphere grid #3286

Closed yantosca closed 5 years ago

yantosca commented 5 years ago

MCVE Code Sample

First download these files:

Then run this code:

import xarray as xr
filelist = ['GCHP.SpeciesConc.20160716_1200z.nc4', 'GCHP.AerosolMass.20160716_1200z.nc4']
ds = xr.open_mfdataset(filelist)
print(ds)

Expected Output

This should load data from both files into a single xarray Dataset object and print its contents.

Problem Description

Instead, this error occurs;

  File "./run_1mo_benchmark.py", line 479, in <module>
    ds = xr.open_mfdataset(filelist])
  File "/net/seasasfs02/srv/export/seasasfs02/share_root/ryantosca/python/geo/miniconda/envs/geo/lib/python3.6/site-packages/xarray/backends/api.py", line 719, in open_mfdataset
    ids=ids)
  File "/net/seasasfs02/srv/export/seasasfs02/share_root/ryantosca/python/geo/miniconda/envs/geo/lib/python3.6/site-packages/xarray/core/combine.py", line 553, in _auto_combine
    data_vars=data_vars, coords=coords)
  File "/net/seasasfs02/srv/export/seasasfs02/share_root/ryantosca/python/geo/miniconda/envs/geo/lib/python3.6/site-packages/xarray/core/combine.py", line 475, in _combine_nd
    compat=compat)
  File "/net/seasasfs02/srv/export/seasasfs02/share_root/ryantosca/python/geo/miniconda/envs/geo/lib/python3.6/site-packages/xarray/core/combine.py", line 493, in _auto_combine_all_along_first_dim
    data_vars, coords)
  File "/net/seasasfs02/srv/export/seasasfs02/share_root/ryantosca/python/geo/miniconda/envs/geo/lib/python3.6/site-packages/xarray/core/combine.py", line 514, in _auto_combine_1d
    merged = merge(concatenated, compat=compat)
  File "/net/seasasfs02/srv/export/seasasfs02/share_root/ryantosca/python/geo/miniconda/envs/geo/lib/python3.6/site-packages/xarray/core/merge.py", line 532, in merge
    variables, coord_names, dims = merge_core(dict_like_objects, compat, join)
  File "/net/seasasfs02/srv/export/seasasfs02/share_root/ryantosca/python/geo/miniconda/envs/geo/lib/python3.6/site-packages/xarray/core/merge.py", line 451, in merge_core
    variables = merge_variables(expanded, priority_vars, compat=compat)
  File "/net/seasasfs02/srv/export/seasasfs02/share_root/ryantosca/python/geo/miniconda/envs/geo/lib/python3.6/site-packages/xarray/core/merge.py", line 170, in merge_variables
    merged[name] = unique_variable(name, var_list, compat)
  File "/net/seasasfs02/srv/export/seasasfs02/share_root/ryantosca/python/geo/miniconda/envs/geo/lib/python3.6/site-packages/xarray/core/merge.py", line 90, in unique_variable
    % (name, out, var))
xarray.core.merge.MergeError: conflicting values for variable 'anchor' on objects to be combined:
first value: <xarray.Variable (nf: 6, ncontact: 4)>
dask.array<shape=(6, 4, 4), dtype=int32, chunksize=(6, 4, 4)>
Attributes:
    long_name:  anchor point
second value: <xarray.Variable (nf: 6, ncontact: 4)>
dask.array<shape=(6, 4, 4), dtype=int32, chunksize=(6, 4, 4)>
Attributes:
    long_name:  anchor point

It seems to get hung up on trying to merge the "anchor" variable. As a workaround, if I drop the "anchor" variable from both datasets and then use xr.open_mfdataset, then the merge works properly.

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 3.10.0-957.12.1.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.2 libnetcdf: 4.6.1 xarray: 0.12.1 pandas: 0.25.1 numpy: 1.16.4 scipy: 1.3.1 netCDF4: 1.4.2 pydap: None h5netcdf: 0.6.2 h5py: 2.8.0 Nio: None zarr: None cftime: 1.0.3.4 nc_time_axis: None PseudonetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.2.1 dask: 2.3.0 distributed: 2.3.2 matplotlib: 3.1.1 cartopy: 0.17.0 seaborn: 0.9.0 setuptools: 41.2.0 pip: 19.2.2 conda: None pytest: 4.2.0 IPython: 7.7.0 sphinx: 2.1.2
dcherian commented 5 years ago

Is anchor equal in the two files or does it need to be concatenated?

yantosca commented 5 years ago

They appear to be the same:

import xarray as xr
ds1 = xr.open_dataset('GCHP.SpeciesConc.20160716_1200z.nc4')
ds2 = xr.open_dataset('GCHP.AerosolMass.20160716_1200z.nc4')
print(ds1['anchor'].values - ds2['anchor'].values)

Which gives:

[[[0 0 0 0]
  [0 0 0 0]
  [0 0 0 0]
  [0 0 0 0]]

 [[0 0 0 0]
  [0 0 0 0]
  [0 0 0 0]
  [0 0 0 0]]

 [[0 0 0 0]
  [0 0 0 0]
  [0 0 0 0]
  [0 0 0 0]]

 [[0 0 0 0]
  [0 0 0 0]
  [0 0 0 0]
  [0 0 0 0]]

 [[0 0 0 0]
  [0 0 0 0]
  [0 0 0 0]
  [0 0 0 0]]

 [[0 0 0 0]
  [0 0 0 0]
  [0 0 0 0]
  [0 0 0 0]]]

I am not 100% sure what this "anchor" variable represents. It was added when NASA updated MAPL to v1.0.0. It seems to be something to do with the cubed-sphere coordinates. But for the purposes of plotting and analyzing the data we don't need it.

If it helps...I also get this error if I just try to subtract the DataArrays from each other directly, instead of the numpy ndarray values:

import xarray as xr
ds1 = xr.open_dataset('GCHP.SpeciesConc.20160716_1200z.nc4')
ds2 = xr.open_dataset('GCHP.AerosolMass.20160716_1200z.nc4')
dr = ds1['anchor'] - ds2['anchor']
print(dr)

Which gives;

Traceback (most recent call last):
  File "./run_1mo_benchmark.py", line 481, in <module>
    dr = ds1['anchor'] - ds2['anchor']
  File "/net/seasasfs02/srv/export/seasasfs02/share_root/ryantosca/python/geo/miniconda/envs/geo   /lib/python3.6/site-packages/xarray/core/dataarray.py", line 2009, in func
    if not reflexive
  File "/net/seasasfs02/srv/export/seasasfs02/share_root/ryantosca/python/geo/miniconda/envs/geo/lib/python3.6/site-packages/xarray/core/variable.py", line 1767, in func
    self_data, other_data, dims = _broadcast_compat_data(self, other)
  File "/net/seasasfs02/srv/export/seasasfs02/share_root/ryantosca/python/geo/miniconda/envs/geo/lib/python3.6/site-packages/xarray/core/variable.py", line 2043, in _broadcast_compat_data
    new_self, new_other = _broadcast_compat_variables(self, other)
  File "/net/seasasfs02/srv/export/seasasfs02/share_root/ryantosca/python/geo/miniconda/envs/geo/lib/python3.6/site-packages/xarray/core/variable.py", line 2018, in _broadcast_compat_variables
    dims = tuple(_unified_dims(variables))
  File "/net/seasasfs02/srv/export/seasasfs02/share_root/ryantosca/python/geo/miniconda/envs/geo/lib/python3.6/site-packages/xarray/core/variable.py", line 2001, in _unified_dims
    'dimensions: %r' % list(var_dims))
ValueError: broadcasting cannot handle duplicate dimensions: ['nf', 'ncontact', 'ncontact']

If two arrays are the same like this, is there a way to tell manually open_mfdataset not to broadcast them but to use the same values?

dcherian commented 5 years ago

It looks like anchor has a repeated dimension name. This is not well supported in xarray. See https://github.com/pydata/xarray/issues/1378. If you don't need it, then I think it's best to drop it.

yantosca commented 5 years ago

Thanks again. I will implement a workaround to drop it (probably a wrapper function that calls open_mfdataset). Good to know.

jhamman commented 5 years ago

You should be able to do what with the drop_variables kwarg to open_mfdataset. If that doesn't work, the preprocess kwarg is the Swiss Army Knife here that lets you pass in a custom function that will be applied before the datasets are combined.

dcherian commented 5 years ago

You can use the drop_variables kwarg. This is passed down to open_dataset. For more general manipulation, you can use the preprocess argument.

dcherian commented 5 years ago

@jhamman Hahaha.

yantosca commented 5 years ago

Thanks, I'll check it out. Wasn't aware of drop_variables.

crusaderky commented 5 years ago

Closing as a duplicate of #1378