ua-snap / cmip6-utils

Pipelines and utilites for working with CMIP6 data
1 stars 1 forks source link

Standardize .nc attributes #26

Closed Joshdpaul closed 6 months ago

Joshdpaul commented 7 months ago

This PR closes #20 and closes #18

New lookup tables have been added to indicators/luts.py containing long form names and descriptions for indicators, models, and scenarios. This info is used in a new build_attrs() function in indicators.py which creates standardized global and variable/coordinate attribute dictionarys. Minimum and maximum values for latitude, longitude, and years are pulled directly from the dataset, dataset units are used to determine the indicator fill value, and the remaining values are hardcoded inluts.py.

Another new function, find_and_replace_attrs(), overwrites the original attribute dictionarys in the individual indicator dataset before writing to disk. This function isn't just a blind overwrite: we check to make sure that all elements in the list of variables/coordinates from the original dataset are also found in the new attribute dictionary. The exception is height which is removed wherever it is encountered.

TO TEST:

Run the prefect flow to compute some indicators as usual, being sure to select the fix_attrs branch. Use any variety of indicator/model/scenario you wish. Then check the attributes of your outputs, and make sure the values are reasonable, there aren't any typos or errors in the descriptive text, etc. Please note any suggestions or improvements, keeping in mind a user who is generally unfamiliar with this data or recieved this file with little to no additional information. What else can we include? Feel free to ticket any ideas!

I used the following two methods to view the attributes, but you can check them any way you like.

which should yield:

{'title': 'Yearly Number of Deep Winter Days (-30C threshold), 2015-2100: GFDL-ESM4-ssp585',
 'author': 'Scenarios Network for Alaska and Arctic Planning (SNAP), International Arctic Research Center, University of Alaska Fairbanks',
 'creation_date': '02/20/2024, 13:37:37',
 'email': 'uaf-snap-data-tools@alaska.edu',
 'website': 'https://uaf-snap.org/',
 'references': 'A list of references TBD!'}

and

{'name': 'latitude', 'units': 'degrees north', 'fill_value': 'NaN', 'lat_max': '90.0', 'lat_min': '50.41884816753927'}
{'name': 'longitude', 'units': 'degrees east', 'fill_value': 'NaN', 'lon_max': '358.75', 'lon_min': '0.0'}
{'start_year': '2015', 'end_year': '2100'}
{'id': 'ssp585', 'ssp': 'ssp5', 'forcing_level': '8.5'}
{'model': 'GFDL-ESM4', 'institution': 'NOAA-GFDL', 'institution_name': 'National Oceanic and Atmospheric Administration, Geophysical Fluid Dynamics Laboratory'}
{'long_name': 'yearly_deep_winter_days_-30C', 'units': 'd', 'fill_value': '-9999', 'description': 'Number of Deep Winter Days, calculated over a yearly frequency with a daily minimum temperature threshold of -30C using xclim.indices.tn_days_below().'}

or in a notebook cell you could just execute ds and view the attributes in the dataset display preview thingy:

image

which should yield something like:

(cmip6-utils) [jdpaul3@chinook04 ~]$ ncdump -h /import/beegfs/CMIP6/jdpaul3/scratch/output/GFDL-ESM4/ssp585/dw/dw_GFDL-ESM4_ssp585_indicator.nc
netcdf dw_GFDL-ESM4_ssp585_indicator {
dimensions:
        lat = 43 ;
        lon = 288 ;
        year = 86 ;
        scenario = 1 ;
        model = 1 ;
variables:
        double lat(lat) ;
                lat:_FillValue = NaN ;
                lat:name = "latitude" ;
                lat:units = "degrees north" ;
                lat:fill_value = "NaN" ;
                lat:lat_max = "90.0" ;
                lat:lat_min = "50.41884816753927" ;
        double lon(lon) ;
                lon:_FillValue = NaN ;
                lon:name = "longitude" ;
                lon:units = "degrees east" ;
                lon:fill_value = "NaN" ;
                lon:lon_max = "358.75" ;
                lon:lon_min = "0.0" ;
        int64 year(year) ;
                year:start_year = "2015" ;
                year:end_year = "2100" ;
        string scenario(scenario) ;
                scenario:id = "ssp585" ;
                scenario:ssp = "ssp5" ;
                scenario:forcing_level = "8.5" ;
        string model(model) ;
                model:model = "GFDL-ESM4" ;
                model:institution = "NOAA-GFDL" ;
                model:institution_name = "National Oceanic and Atmospheric Administration, Geophysical Fluid Dynamics Laboratory" ;
        int64 dw(scenario, model, year, lat, lon) ;
                dw:long_name = "yearly_deep_winter_days_-30C" ;
                dw:units = "d" ;
                dw:fill_value = "-9999" ;
                dw:description = "Number of Deep Winter Days, calculated over a yearly frequency with a daily minimum temperature threshold of -30C using xclim.indices.tn_days_below()." ;

// global attributes:
                :title = "Yearly Number of Deep Winter Days (-30C threshold), 2015-2100: GFDL-ESM4-ssp585" ;
                :author = "Scenarios Network for Alaska and Arctic Planning (SNAP), International Arctic Research Center, University of Alaska Fairbanks" ;
                :creation_date = "02/20/2024, 13:37:37" ;
                :email = "uaf-snap-data-tools@alaska.edu" ;
                :website = "https://uaf-snap.org/" ;
                :references = "A list of references TBD!" ;
}

Note here that ncdump shows an additional _FillValue = NaN for lat and lon coordinates, which I assume is being pulled from the actual data. The second fill_value = "NaN" is the one I provide in the standardized attribute dictionary. This is redundant when using ncdump but since xarray does not display NA values in the same way, it might be OK to keep as is. What are your thoughts on that?