icenet-ai / icenet-roadmap

Central repository for the Turing-British Antarctic Survey IceNet Project
MIT License
7 stars 0 forks source link

ERA5 oddness in TAS #30

Closed JimCircadian closed 2 years ago

JimCircadian commented 2 years ago

Somehow the antarctic TAS dataset ended up with extraneous variables not present elsewhere. This could be a side effect of shifting the downloader to the toolbox, but wasn't present outside the bounds of 1990-1999

The following fixed the issue, but should check and ensure the datasets don't need stripping

process to remove lambert var and rename t2m to tas: 

('./data/era5/sh/tas/1999/1999_01_01.nc', './data/era5/sh/tas/1999/1999_12_31.nc')
>>> xr.open_dataset("./data/era5/sh/tas/1999/1999_01_01.nc")
<xarray.Dataset>
Dimensions:                       (time: 1, yc: 432, xc: 432)
Coordinates:
  * time                          (time) datetime64[ns] 1999-01-01
  * yc                            (yc) float64 5.388e+06 ... -5.388e+06
  * xc                            (xc) float64 -5.388e+06 ... 5.388e+06
Data variables:
    t2m                           (time, yc, xc) float32 ...
    lambert_azimuthal_equal_area  int32 ...
Attributes:
    Conventions:  CF-1.7
>>> xr.open_dataset("./data/era5/sh/tas/2000/2000_01_01.nc")
<xarray.Dataset>
Dimensions:  (time: 1, yc: 432, xc: 432)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01
  * yc       (yc) float64 5.388e+06 5.362e+06 ... -5.362e+06 -5.388e+06
  * xc       (xc) float64 -5.388e+06 -5.362e+06 ... 5.362e+06 5.388e+06
Data variables:
    tas      (time, yc, xc) float32 ...

import xarray as xr
import pandas as pd
import glob, os

base_path = "./data/era5/sh/tas/"
files = []
for dt in pd.date_range("1990-1-1", "1999-12-31"):
    path = os.path.join(base_path, str(dt.year), dt.strftime("%Y_%m_%d.nc"))
    temp_path = os.path.join(base_path, str(dt.year), dt.strftime("temp.%Y_%m_%d.nc"))
    print("{} -> {}".format(path, temp_path))
    os.rename(path, temp_path)
    files.append(temp_path)

ds = xr.open_mfdataset(files, concat_dim="time", combine="nested", parallel=True)
ds = ds.drop_vars(["lambert_azimuthal_equal_area"])
ds = ds.rename_vars(dict(t2m="tas"))

for dt in ds.time.values:
    day = pd.to_datetime(dt)
    daily_path = os.path.join(base_path, str(day.year), day.strftime("%Y_%m_%d.nc"))
    print(daily_path)
    ds.sel(time=slice(day, day)).to_netcdf(daily_path)

ds.close()
for file in files:
    print(file)
    os.unlink(file)
tom-andersson commented 2 years ago

Weird that there's this inconsistency between 1999 and 2000...

FYI a hack I used to just open the first ERA5 data variable in the NetCDF dataset is the following:

            with xr.open_dataset(fpath) as ds:
                da = next(iter(ds.data_vars.values()))

In practice I've found this means the lambert_... var always gets ignored, and you don't need to know the name of the data_var of interest - just the fpath. It's definitely VERY dodgy though!

JimCircadian commented 2 years ago

No worries @tom-andersson, all reprocessed and updated. I need to verify that this is an artifact from switching between API and toolbox. I was just surprised it didn't fall over during regridding...

JimCircadian commented 2 years ago

Gist here for doing the whole lot

JimCircadian commented 2 years ago

Looking at the datasets as videos, this appears to have been rectified. There was definitely an issue with regrid processing that has also been solved recently, so closing this. Future reproducible issues will be contained in the icenet2 repository for this project.