pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.55k stars 1.06k forks source link

'Time' dimensions not consistently saved when using 'to_netcdf' #5552

Open michaelavs opened 3 years ago

michaelavs commented 3 years ago

What happened: When using a for loop to subset 3 datasets, the second dataset only preserves 1 Time dimension while the first and third preserve all 4 Time dimensions. However, when subsetting the datasets individually (i.e. not using a for loop), the Time dimensions are all preserved for each file. I'm not sure if this is a bug or user error, but could possibly be related to #5549.

What you expected to happen: Expected all three files to subset in the same capacity with same dimensions for variables.

Minimal Complete Verifiable Example: Python code for the for loop to subset:

# Put your MCVE code here

wrf_path = '/Users/misi1684/wrf_python_tutorial/wrf_tutorial_data'
os.chdir(wrf_path)
wrf_file0 = glob.glob('wrf*')

toinclude = ['HGT', 'QVAPOR', 'QRAIN', 'QCLOUD', 'P', 'PB', 'PHB', 'PH', 'T', 'U', 'V', 'W']

for i in range(0,3):
    wrf_file = wrf_file0[i]
    wrf_out = xr.open_dataset(wrf_file, decode_times=False)
    # print(wrf_out)
    wrf_out = wrf_out[toinclude].to_netcdf('wrf_output_subset' + str(i) + '.nc')

Anything else we need to know?: Incase you are not familiar with a wrfout file, it is the same as a netcdf file. Generally, I used netCDF Dataset to read in these types of files, but netCDF does not have the same subsetting capabilities as xarray. If you need more information or printouts from these datasets, please let me know!

Environment:

Output of xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.8.10 | packaged by conda-forge | (default, May 11 2021, 06:39:48) [Clang 11.1.0 ] python-bits: 64 OS: Darwin OS-release: 19.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: None LOCALE: (None, 'UTF-8') libhdf5: 1.10.6 libnetcdf: 4.8.0 xarray: 0.18.2 pandas: 1.2.4 numpy: 1.20.3 scipy: 1.6.3 netCDF4: 1.5.6 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.5.0 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: None distributed: None matplotlib: 3.3.0 cartopy: 0.19.0.post1 seaborn: None numbagg: None pint: 0.17 setuptools: 49.6.0.post20210108 pip: 21.1.2 conda: None pytest: None IPython: 7.24.0 sphinx: None
dcherian commented 3 years ago

Hello,

It is hard to help without a reproducible example. Can you make one? Some reprs would help too... are different variables associated with different time dimensions?

michaelavs commented 3 years ago

Hi @dcherian, Here is the full code which has a version that works, and a version that causes the bug. If you are interested in running this code locally, the datafiles used can be found here under 'wrf_tutorial_data'. If you do clone that repository, make sure to do it with a recurse clone and not a standard git clone because of the datafiles directory.

import os
import glob
import numpy as np

import matplotlib.pyplot as plt
from netCDF4 import Dataset
import xarray as xr
from wrf import (getvar, ALL_TIMES)

wrf_path = '/Users/misi1684/wrf_python_tutorial/wrf_tutorial_data'
os.chdir(wrf_path)
wrf_file0 = glob.glob('wrfout*')

# Read in the data sets to be subset then export them as new nc files
toinclude = ['HGT', 'QVAPOR', 'QRAIN', 'QCLOUD', 'P', 'PB', 'PHB', 'PH', 'T', 'U', 'V', 'W']
for i in range(0,3):
    # print(wrf_file0[i])
    wrf_file = wrf_file0[i]
    wrf_out = xr.open_dataset(wrf_file, decode_times=False)
    # print(wrf_out) # Shows that all original files do have 4 dimensions for Time coordinate
    wrf_out = wrf_out[toinclude].to_netcdf('wrf_output_subset' + str(i) + '.nc')

WRF_FILES = glob.glob('wrf_output*')
# print(WRF_FILES)
_WRF_FILES = [os.path.abspath(os.path.expanduser(
    os.path.join(wrf_path, f))) for f in WRF_FILES]

def multiple_wrf_files():
    global _WRF_FILES
    return _WRF_FILES

file_paths = multiple_wrf_files()
wrf_files = [Dataset(f) for f in file_paths] # Three files read in
# print(wrf_files[0]) # prints the data from just the first subset file (wrf_output_subset0.nc)
# print(wrf_files) # prints the data from all three subset files

slp = [getvar(wrf_files, 'slp', timeidx=i, units='mb') for i in range(0,3)]
# print(slp[2]) # This prints the data for the last time step in the data. valid values are 0, 1, 2

# Code to break the Time coordinate dimensions and to show the differences in dimensions
wrf_0 = wrf_files[0]
wrf_1 = wrf_files[1]
wrf_2 = wrf_files[2]
print("Subset0", wrf_0.dimensions['Time'])
print("Subset1",wrf_1.dimensions['Time'])
print("Subset2",wrf_2.dimensions['Time']) # size is 1 while other two have size 4

# The issue comes when using 'timeidx=ALL_TIMES' which is the same as 'None'
# Default value for this function is timeidx=0, which also works as seen above

# slp = getvar(wrf_files, 'slp', timeidx=ALL_TIMES, units='mb')

I do want to say I realized my suspicion on what the problem was has changed slightly. In the tutorial from 2021, I showed attendees how to use xarray to subset single wrfout files in case they only want the bare minimum variables needed from the file (for file size purposes). The user who contacted me about this issue was using the subsetting code with multiple files which then would return the error "ValueError: coordinate 'Time' is a DataArray dimension, but it has shape () rather than expected shape 4 matching the dimension size" when running with the getvar function using ALL_TIMES. Now, when I subset one file, the getvar function is fine; when I used getvar on multiple un-subset wrfout files with ALL_TIMES, the function is fine. But for some reason when multiple files are subset that error is thrown and I can't figure out why. Again, I'm not sure if this is a user error thing or a bug in the code. I want to say something may be happening with the meta data when multiple files are being read in which causes the file that is subset to just drop its time dimensions, but I just really can't place where the true issue may be happening along the line.