jbusecke / xMIP

Analysis ready CMIP6 data in python the easy way with pangeo tools.
https://cmip6-preprocessing.readthedocs.io/en/latest/?badge=latest
Apache License 2.0
199 stars 44 forks source link

Revise the logic for model specific grid information #122

Open jbusecke opened 3 years ago

jbusecke commented 3 years ago

Following on #105, I think it is worth discussing in more detail if we can improve the way this package is able to deal with grid setup information that can differ for each model (and is necessary for the xgcm grid setup in the grids module).

What is currently done

create_full_grid requires a dictionary that encodes the information parsed as coords to xgcm.Grid. The default value None will actually load this file here and look up if the current model (source_id) is given by that file. If not it will error out, which can lead to problems for users (see #105).

Problems

This approach relies on manual input of the grid configuration. It would be nice if this information was encoded in the metadata conventions somehow, but I am quite convinced that the metadata in any of the CMIP6 output is not sufficient to infer this. Additionally, some of the outputs have been shipped with incorrect lon/lat values (example), which makes inferring the grid positions (as I have done until now with this notebook) problematic.

Possible Solutions

a) A short term fix could be to insert reasonable defaults here if the yaml file does not have an entry.

Question: What do folks here think about that? Can we default to e.g. an A-grid, and raise a warning? That would trade off usability against accuracy. Personally I think this is acceptable (if appropriate warnings are included).

b) A longer term fix would be to study all of the models and actually confirm the grid setup! Long term it seems that this should be part of a broader outreach effort to modelling centers in order to get more detailed grid metrics.

I think that we can work on both in parallel! a) Should be a fairly easy start, but b) is a substantial long term effort (still worth it in my opinion).

jbusecke commented 3 years ago

Thinking further about this, I would also like to discuss the possibility of encoding the actual position of each variable somehow.

In particular this

some of the outputs have been shipped with incorrect lon/lat values

makes me believe that my attempt to detect grid positions of each variable (see here) might lead to undetected errors. It is also quite slow!

Since this is likely a longer-term effort, we should also think about inserting this information in the preprocessing stage via additional metadata (SGRID comes to mind, see https://github.com/xgcm/xgcm/issues/109 and https://github.com/xarray-contrib/cf-xarray/issues/220) cc @dcherian. I wonder if this sort of additional metadata would be considered a 'fix' in the daops database (ideally via the daops database https://github.com/roocs/daops/issues/55#issuecomment-782037323) cc @agstephens That might slim down the grid specific logic here a lot and relegate to either cf-xarray or xgcm, which I would welcome.

Very eager for your thoughts.