Unable to load multiple WRF NetCDF files into Dask array on pangeo

porterdf commented 3 years ago

(Sorry if this is not correct place for this, obviously Pangeo repo is another option)

Working with @jkingslake, our immediate goal is to load many WRF history files (3 hourly in this case), currently stored in our public GCS, and write out to Zarr for public/open use. We can get all of this working on local machine, but on us-central1-b open_mfdataset returns the following error ValueError: Could not find any dimension coordinates to use to order the datasets for concatenation. I suspect this is related to non-standard, CF-incompliant dimensions in WRF (each model initialization will have multiple forecast period, i.e. 'minutes since 2019-12-31 00:00:00')

As this gist shows, opening a single WRF file works as expected. Is this a limitation of open_mfdatasets? Alternatively, looping through each file and using xr.concat along dim='Time' works locally, but not on us-central1-b.

xr.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.8.6 | packaged by conda-forge | (default, Jan 25 2021, 23:21:18) 
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 4.19.112+
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.6
libnetcdf: 4.7.4

xarray: 0.16.2
pandas: 1.2.1
numpy: 1.20.0
scipy: 1.6.0
netCDF4: 1.5.5.1
pydap: installed
h5netcdf: 0.8.1
h5py: 3.1.0
Nio: None
zarr: 2.6.1
cftime: 1.4.1
nc_time_axis: 1.2.0
PseudoNetCDF: None
rasterio: 1.2.0
cfgrib: 0.9.8.5
iris: None
bottleneck: 1.3.2
dask: 2021.01.1
distributed: 2021.01.1
matplotlib: 3.3.4
cartopy: 0.18.0
seaborn: None
numbagg: None
pint: 0.16.1
setuptools: 49.6.0.post20210108
pip: 20.3.4
conda: None
pytest: None
IPython: 7.20.0
sphinx: 3.4.3

shoyer commented 3 years ago

I suspect there is at least one netCDF file with inconsistent metadata, e.g., without a Time dimension. If you can find and fix that dataset (or otherwise deal with it in whatever special way is required), then that would resolve the issue. In my experience, looping through files (rather than using open_mfdataset) is definitely helpful in this regard because you can verify that each file has the expected metadata.

The only reason why I can imagine this behavior might be different in GCP rather than on your workstation would be if you are using inconsistent package version.

Note: In general, for multi-file netCDF -> Zarr workflows you might check out pangeo-forge: https://github.com/pangeo-forge/pangeo-forge

porterdf commented 3 years ago

Thanks for the great suggestion @shoyer - your suggestion to loop through the netCDF files is working well in Dask using the following code:

import xarray as xr
import gcsfs
from tqdm.autonotebook import tqdm
xr.set_options(display_style="html");

fs = gcsfs.GCSFileSystem(project='ldeo-glaciology', mode='r',cache_timeout = 0)
NCs = fs.glob('gs://ldeo-glaciology/AMPS/WRF_24/domain_02/*.nc')
url = 'gs://' + NCs[0]
openfile = fs.open(url, mode='rb') 
ds = xr.open_dataset(openfile, engine='h5netcdf',chunks={'Time': -1})
for i in tqdm(range(1, 8)):
    url = 'gs://' + NCs[i]
    openfile = fs.open(url, mode='rb') 
    temp = xr.open_dataset(openfile, engine='h5netcdf',chunks={'Time': -1})
    ds = xr.concat([ds,temp],'Time')

However, I am still confused why open_mfdataset was not parsing the Time dimension - the concatenated DataSet using the looping method above appears to have a time dimension compatible with datetime64[ns].

>> ds.coords['XTIME'].compute()

xarray.DataArray'XTIME'Time: 8
array(['2019-01-01T03:00:00.000000000', '2019-01-01T06:00:00.000000000',
       '2019-01-01T09:00:00.000000000', '2019-01-01T12:00:00.000000000',
       '2019-01-01T15:00:00.000000000', '2019-01-01T18:00:00.000000000',
       '2019-01-01T21:00:00.000000000', '2019-01-02T00:00:00.000000000'],
      dtype='datetime64[ns]')

dcherian commented 3 years ago

Does it work if you pass concat_dim="XTIME"?

porterdf commented 3 years ago

Thanks @dcherian

>> ds = xr.open_mfdataset(NCs_urls, engine='netcdf4', 
                       parallel=True, 
                       concat_dim='XTIME',
                      )

ValueError: Could not find any dimension coordinates to use to order the datasets for concatenation

So it doesn't work, but perhaps that's not surprising give that 'XTIME' is a coordinate, but 'Time' is the dimension (one of WRF's quirks related to staggered grids and moving nests).

>> print(ds.coords)

Coordinates:
    XLAT     (Time, south_north, west_east) float32 dask.array<chunksize=(8, 1035, 675), meta=np.ndarray>
    XLONG    (Time, south_north, west_east) float32 dask.array<chunksize=(8, 1035, 675), meta=np.ndarray>
    XTIME    (Time) datetime64[ns] dask.array<chunksize=(8,), meta=np.ndarray>
    XLAT_U   (Time, south_north, west_east_stag) float32 dask.array<chunksize=(8, 1035, 676), meta=np.ndarray>
    XLONG_U  (Time, south_north, west_east_stag) float32 dask.array<chunksize=(8, 1035, 676), meta=np.ndarray>
    XLAT_V   (Time, south_north_stag, west_east) float32 dask.array<chunksize=(8, 1036, 675), meta=np.ndarray>
    XLONG_V  (Time, south_north_stag, west_east) float32 dask.array<chunksize=(8, 1036, 675), meta=np.ndarray>

As such, I'm following the documentation to add a preprocessor ds.swap_dims({'Time':'XTIME'}), which works as expected.

Thanks for everyone's help! Shall I close this? (as it was never actually an issue?

dcherian commented 3 years ago

Great. Thanks for following up @porterdf

pydata / xarray

Unable to load multiple WRF NetCDF files into Dask array on pangeo #5023