Closed porterdf closed 3 years ago
I suspect there is at least one netCDF file with inconsistent metadata, e.g., without a Time
dimension. If you can find and fix that dataset (or otherwise deal with it in whatever special way is required), then that would resolve the issue. In my experience, looping through files (rather than using open_mfdataset
) is definitely helpful in this regard because you can verify that each file has the expected metadata.
The only reason why I can imagine this behavior might be different in GCP rather than on your workstation would be if you are using inconsistent package version.
Note: In general, for multi-file netCDF -> Zarr workflows you might check out pangeo-forge: https://github.com/pangeo-forge/pangeo-forge
Thanks for the great suggestion @shoyer - your suggestion to loop through the netCDF files is working well in Dask using the following code:
import xarray as xr
import gcsfs
from tqdm.autonotebook import tqdm
xr.set_options(display_style="html");
fs = gcsfs.GCSFileSystem(project='ldeo-glaciology', mode='r',cache_timeout = 0)
NCs = fs.glob('gs://ldeo-glaciology/AMPS/WRF_24/domain_02/*.nc')
url = 'gs://' + NCs[0]
openfile = fs.open(url, mode='rb')
ds = xr.open_dataset(openfile, engine='h5netcdf',chunks={'Time': -1})
for i in tqdm(range(1, 8)):
url = 'gs://' + NCs[i]
openfile = fs.open(url, mode='rb')
temp = xr.open_dataset(openfile, engine='h5netcdf',chunks={'Time': -1})
ds = xr.concat([ds,temp],'Time')
However, I am still confused why open_mfdataset
was not parsing the Time
dimension - the concatenated DataSet using the looping method above appears to have a time dimension compatible with datetime64[ns].
>> ds.coords['XTIME'].compute()
xarray.DataArray'XTIME'Time: 8
array(['2019-01-01T03:00:00.000000000', '2019-01-01T06:00:00.000000000',
'2019-01-01T09:00:00.000000000', '2019-01-01T12:00:00.000000000',
'2019-01-01T15:00:00.000000000', '2019-01-01T18:00:00.000000000',
'2019-01-01T21:00:00.000000000', '2019-01-02T00:00:00.000000000'],
dtype='datetime64[ns]')
Does it work if you pass concat_dim="XTIME"
?
Thanks @dcherian
>> ds = xr.open_mfdataset(NCs_urls, engine='netcdf4',
parallel=True,
concat_dim='XTIME',
)
ValueError: Could not find any dimension coordinates to use to order the datasets for concatenation
So it doesn't work, but perhaps that's not surprising give that 'XTIME' is a coordinate, but 'Time' is the dimension (one of WRF's quirks related to staggered grids and moving nests).
>> print(ds.coords)
Coordinates:
XLAT (Time, south_north, west_east) float32 dask.array<chunksize=(8, 1035, 675), meta=np.ndarray>
XLONG (Time, south_north, west_east) float32 dask.array<chunksize=(8, 1035, 675), meta=np.ndarray>
XTIME (Time) datetime64[ns] dask.array<chunksize=(8,), meta=np.ndarray>
XLAT_U (Time, south_north, west_east_stag) float32 dask.array<chunksize=(8, 1035, 676), meta=np.ndarray>
XLONG_U (Time, south_north, west_east_stag) float32 dask.array<chunksize=(8, 1035, 676), meta=np.ndarray>
XLAT_V (Time, south_north_stag, west_east) float32 dask.array<chunksize=(8, 1036, 675), meta=np.ndarray>
XLONG_V (Time, south_north_stag, west_east) float32 dask.array<chunksize=(8, 1036, 675), meta=np.ndarray>
As such, I'm following the documentation to add a preprocessor ds.swap_dims({'Time':'XTIME'})
, which works as expected.
Thanks for everyone's help! Shall I close this? (as it was never actually an issue?
Great. Thanks for following up @porterdf
(Sorry if this is not correct place for this, obviously Pangeo repo is another option)
Working with @jkingslake, our immediate goal is to load many WRF history files (3 hourly in this case), currently stored in our public GCS, and write out to Zarr for public/open use. We can get all of this working on local machine, but on us-central1-b
open_mfdataset
returns the following errorValueError: Could not find any dimension coordinates to use to order the datasets for concatenation
. I suspect this is related to non-standard, CF-incompliant dimensions in WRF (each model initialization will have multiple forecast period, i.e. 'minutes since 2019-12-31 00:00:00')As this gist shows, opening a single WRF file works as expected. Is this a limitation of open_mfdatasets? Alternatively, looping through each file and using
xr.concat
along dim='Time' works locally, but not on us-central1-b.