Using xr.to_netcdf after xr.concat introduces max/min constraints based on first file

MRibberink commented 12 months ago

What happened?

I'm using xarray to concatenate several netcdf files in time (ERA5 mean sea level data that needs to be downloaded in separate months), but because I have a lot of them I save the concatenated files to netcdf after using them so I can later do my data analysis in a loop rather than each by hand. Doing so unfortunately seems to set a min/max on the whole timeseries based on the first file, though this doesn't exist when the data is plotted before being processed by to_netcdf.

Data can be found here: https://www.dropbox.com/sh/xb1c2018ey83poc/AABDpQnjbWcGdtq_MwJeEEc-a?dl=0

The code below generates several diagnostic images, primary among them showing the problem is this one: min_pres_error

Where the blue line is just the concatenated original datasets, and the orange line is that data saved by to_netcdf and then called up again. The first file ends at the end of August.

The same problem causes weird phenomena when looking at the sea level pressure fields, as shown in the other created plots. The first is how it should look (just concatted together), but the second is what it actually looks like (to_netcdf). Here's how they look on my end:

I also included a cross-section through the phenomenon off the coast of Canada to show what it's doing when it encounters values below the minimum - it brings them so the first value below is at the level of the maximum defined earlier:

I found a weird workaround, the problem is fixed by adding 0 to the dataset before sending it to to_netcdf, but this still shouldn't be happening (and I have NO idea why that fixes it).

What did you expect to happen?

I expected to_netcdf to accurately save the dataset the way it is, rather than setting a max/min based on the first file. In essence, to reproduce the minimum pressure line that was plotted before saving and reloading the data.

Minimal Complete Verifiable Example

import xarray as xr
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
crs = ccrs.PlateCarree()

#Import and concat data
Lor1=xr.open_dataset('data_62_2022.nc')
Lor2=xr.open_dataset('data_63_2022.nc')

Lor3=xr.concat([Lor1,Lor2],dim='time')

#adding this line is a workaround and makes it function for some reason
#Lor3['msl']=Lor3['msl']+0.0

#Save and reimport data
Lor3.to_netcdf("test/lor_4.nc")
Lor4=xr.open_dataset('test/lor_4.nc')

#Diagnostic plots

#minimum pressure
fig,ax=plt.subplots()
Lor3.msl.min("longitude").min('latitude').plot(label='concat')
Lor4.msl.min("longitude").min('latitude').plot(label='to_netcdf')
plt.legend()
plt.title("Minimum pressure")

#mslp field concat (what it should look like)
fig1=plt.figure(dpi=150)
ax1 = plt.axes(projection=crs, frameon=True)
ax1.coastlines(resolution='50m')
Lor3.msl.isel(time=250).plot(ax=ax1)
ax1.set_title("Concat")

#mslp field to_netcdf (what it actually looks like)
fig2=plt.figure(dpi=150)
ax2 = plt.axes(projection=crs, frameon=True)
ax2.coastlines(resolution='50m')
Lor4.msl.isel(time=250).plot(ax=ax2)
ax2.set_title("to_netcdf")

#cross section mslp field
fig3,ax3=plt.subplots()
Lor3.msl.isel(time=250).sel(latitude=42).plot(label='concat')
Lor4.msl.isel(time=250).sel(latitude=42).plot(ls='--',label='to_netcdf')
ax3.axhline(Lor1.msl.max(),color='k',ls='--',label='max file 1')
ax3.axhline(Lor1.msl.min(),color='k',ls=':',label='min file 1')
plt.legend()

MVCE confirmation

[X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
[X] Complete example — the example is self-contained, including all data and the text of any traceback.
[X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
[X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.10.9 | packaged by conda-forge | (main, Feb 2 2023, 20:14:58) [MSC v.1929 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 142 Stepping 11, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: ('English_Canada', '1252') libhdf5: 1.14.1 libnetcdf: 4.9.2 xarray: 2023.6.0 pandas: 1.5.3 numpy: 1.24.2 scipy: 1.10.0 netCDF4: 1.6.4 pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.13.3 cftime: 1.6.2 nc_time_axis: None PseudoNetCDF: None iris: None bottleneck: None dask: 2023.6.0 distributed: 2023.6.0 matplotlib: 3.7.0 cartopy: 0.21.1 seaborn: None numbagg: None fsspec: 2023.4.0 cupy: None pint: 0.22 sparse: None flox: None numpy_groupies: None setuptools: 68.0.0 pip: 23.0 conda: None pytest: 7.4.0 mypy: None IPython: 8.12.2 sphinx: 4.4.0

welcome[bot] commented 12 months ago

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

kmuehlbauer commented 12 months ago

@MRibberink Thanks for the report.

Most likely issue is different cf-coding attributes on that variable (maybe packed-data, scale_factor/add_offset).

You can check this with:

print(Lor1.encoding, Lor1["msl"].encoding)
print(Lor2.encoding, Lor2["msl"].encoding)

During concatenating only the encoding-dict of the first object survives (as you already assessed), leading to a wrong/different encoding into netcdf.

If different cf-coding attributes is the issue, you have several options:

adapt encoding to fit your data
remove encoding entirely and write float to disk (https://docs.xarray.dev/en/stable/generated/xarray.DataArray.reset_encoding.html)
override encoding-kwarg in to_netcdf for the variable in question.

kmuehlbauer commented 12 months ago

It looks like my first guess was correct, we have packed data for the msl variable. Thanks for providing the data @MRibberink.

short msl(time, latitude, longitude) ;
    msl:scale_factor = 0.0755888254772405 ;
    msl:add_offset = 100995.993455587 ;
    msl:_FillValue = -32767s ;
    msl:missing_value = -32767s ;
    msl:units = "Pa" ;
    msl:long_name = "Mean sea level pressure" ;
    msl:standard_name = "air_pressure_at_mean_sea_level" ;

kmuehlbauer commented 12 months ago

To answer your question: Adding 0.0 to the variable turns it into float and the encoding-dict will be cleared. This will in turn write floating point data to netcdf, which might result in a somewhat larger file.

Please see also the discussion on future of encoding-dict here: https://github.com/pydata/xarray/issues/6323

MRibberink commented 12 months ago

Wonderful. Thank you so much!

kmuehlbauer commented 12 months ago

@MRibberink You are very welcome. So assuming you've got the answers you need, we can close this. Please do not hesitate to open follow-up issues whenever necessary.

pydata / xarray