Closed tessjacobson closed 5 months ago
Hi @tessjacobson. Thanks for using xMIP and reporting this issue. Apologies for the long wait on this.
I just ran this on the LEAP-Pangeo hub () and got this:
I confirmed that this is introduced by combined_preprocessing
:
ds = xr.open_dataset('gs://cmip6/CMIP6/CMIP/HAMMOZ-Consortium/MPI-ESM-1-2-HAM/historical/r3i1p1f1/Omon/tos/gn/v20191218/', engine='zarr', chunks={}, **z_kwargs)
combined_preprocessing(ds)
The error message is quite hard to read, but I think I have a solution:
ds_fixed = ds.copy()
for var in ds_fixed.variables:
unit = ds_fixed[var].attrs.get('units', None)
if isinstance(unit, int):
del ds_fixed[var].attrs['unit']
print(f"{var} {unit}")
combined_preprocessing(ds_fixed)
this works! It turns out that the original dataset had a unit of 1
, which does not make a lot of sense.
This should be easily fixable. Ill consult with the pint crowd @TomNicholas @keewis what the best way of attack is here? Is this something that I should/could change in the unit registry or do you think it is better to just delete all integer unit attributes like above?
I think dimensionless numbers in pint are just supposed to be represented with a unit of ''
. I guess you probably could create some new definition in the registry, but if you're already changing dodgy metadata in the CMIP data, then maybe simplest to just change this dodgy metadata too?
yeah, you can put ""
or "dimensionless"
there. I believe the string "1"
also works (but if you have an integer it will fail).
Note that "1"
appears to be explicitly mentioned in the CF conventions as valid. So if you want to fix the units, just run str
on it?
That seems like a nice alternative to ripping it out! Thanks
I just added #331, but that did not fix the issue above. I think I misdiagnosed this. It seems to be the time_bounds
dimension that is upsetting pint.
from xmip.preprocessing import combined_preprocessing
import intake
import dask
url = "https://storage.googleapis.com/cmip6/pangeo-cmip6.json"
col = intake.open_esm_datastore(url)
query = dict(
activity_id="CMIP",
experiment_id="historical",
variable_id=["tos"],
table_id="Omon",
source_id = ['MPI-ESM-1-2-HAM'],
member_id = 'r3i1p1f1',
grid_label = 'gn'
)
cat_mon = col.search(**query)
z_kwargs = {'consolidated': True, 'decode_times':False}
with dask.config.set(**{'array.slicing.split_large_chunks': True}):
dset_dict = cat_mon.to_dataset_dict(zarr_kwargs=z_kwargs)
from xmip.preprocessing import correct_units
ds_test = ds.drop(['time', 'time_bnds'])
correct_units(ds_test)
ds_test
works as intended!!! So it is the undecoded time units that cause the failure.
As a quick fix for @tessjacobson: Could you just decode the times? Or was there a specific reason not to do that.
from xmip.preprocessing import combined_preprocessing
import intake
import dask
url = "https://storage.googleapis.com/cmip6/pangeo-cmip6.json"
col = intake.open_esm_datastore(url)
query = dict(
activity_id="CMIP",
experiment_id="historical",
variable_id=["tos"],
table_id="Omon",
source_id = ['MPI-ESM-1-2-HAM'],
member_id = 'r3i1p1f1',
grid_label = 'gn'
)
cat_mon = col.search(**query)
z_kwargs = {'consolidated': True, 'decode_times':True}
with dask.config.set(**{'array.slicing.split_large_chunks': True}):
dset_dict = cat_mon.to_dataset_dict(zarr_kwargs=z_kwargs, preprocess=combined_preprocessing)
works as intended.
A more high level question for @keewis and @TomNicholas :
The units upsetting pint seem to be
time-Units: hours since 1850-01-16 12:00:00.000000(<class 'str'>)
time_bnds-Units: days since 1850-01-01(<class 'str'>)
is there a way pint-xarray could/should detect encoded time dimensions and leave them alone?
Another question for the whole group: Is #331 still generally useful? Or should I close this and wait until we actually find a wrongly typed unit in the wild?
pint-xarray could/should detect encoded time dimensions and leave them alone?
I don't understand. How is pint-xarray
seeing these units if decode_times=True
? Xarray should be handling them.
Or should I close this and wait until we actually find a wrongly typed unit in the wild?
I would happily add this to cf_xarray.units
(https://github.com/xarray-contrib/cf-xarray/blob/main/cf_xarray/units.py).
I don't understand. How is pint-xarray seeing these units if decode_times=True? Xarray should be handling them.
It is not. The issue here is that @tessjacobson explicitly set decode_times=False
(which is often useful for debugging etc, and ideally should not break the preprocessing).
But since xarray still 'knows' about the time dimension I feel we should be able to just leave them as is.
@keewis is there a way to have cf_xarray
's unit registry ignore these time units if they exist
I'm not sure. We'd need to be able to tell pint
to return None
for these, as that will tell pint-xarray
not to do anything with that particular variable. However, since time is directly supported in xarray
(with a unit string of "{units} since {date}"
) I think it would be justified to do this directly in pint-xarray
(which I would expect to be a lot easier).
See also #279
Is #331 still generally useful? Or should I close this and wait until we actually find a wrongly typed unit in the wild?
Actually, it might be better to move this to cf-xarray
's preprocessors? I forgot where we had that discussion, but I do remember hearing about units of 1
before, so this appears to be somewhat common?.
Edit: that was in xarray-contrib/cf-xarray#238
That would cast everything to a str
before applying anything else. Unfortunately, pint
does the weird thing of inserting the "%"
→ " percent"
transformation in first place somewhere in the constructor, so we can't just insert it into the existing list of preprocessors in cf-xarray
. However, this works:
import pint
ureg = pint.UnitRegistry(preprocessors=[...])
ureg.preprocessors.insert(0, str)
ureg.parse_units(1)
Just to clarify, I think the 1
unit was actually not a problem at all. I think I just misdiagnosed this in the beginning. So this really just is a problem with the encoded time dimensions AFAICT
the datetime unit issue should be fixed by the PR on pint-xarray
, and I will make sure to issue a bugfix release very soon (I apparently neglected the project a bit regarding that).
For the 1
unit see the PR I just opened on cf-xarray
.
Amazing. Thank you so much @keewis.
I just tested this on the LEAP-Pangeo hub from the main of cf-xarray and pint-xarray
pip install git+https://github.com/xarray-contrib/cf-xarray.git git+https://github.com/xarray-contrib/pint-xarray.git
and this ran without error:
from xmip.preprocessing import combined_preprocessing
import intake
import dask
url = "https://storage.googleapis.com/cmip6/pangeo-cmip6.json"
col = intake.open_esm_datastore(url)
query = dict(
activity_id="CMIP",
experiment_id="historical",
variable_id=["tos"],
table_id="Omon",
source_id = ['MPI-ESM-1-2-HAM'],
member_id = 'r3i1p1f1',
grid_label = 'gn'
)
cat_mon = col.search(**query)
z_kwargs = {'consolidated': True, 'decode_times':False}
with dask.config.set(**{'array.slicing.split_large_chunks': True}):
dset_dict = cat_mon.to_dataset_dict(zarr_kwargs=z_kwargs, preprocess=combined_preprocessing)
I will close this issue now. Please feel free to open again if this should not work for you @tessjacobson.
you might also want to close #279
it took me a while, but the fix in pint-xarray
just hit PyPI
Awesome. Thanks @keewis
I'm trying to preprocess SST data in all the historical CMIP6 runs and running into an issue with
combined_preprocessing
. This happens with any of the models/members but is shown below for a single model/member. It seems to be happening in thecorrect_units
step. Using v0.21 ofpint
and v0.7.1 ofxmip
.