pacificclimate / data-prep-actions

Data Preparation actions - record of ephemera used to prepare data for PCIC data portals and other tools
0 stars 0 forks source link

Formatting for time-invariant watershed data #21

Closed corviday closed 4 years ago

corviday commented 5 years ago

This isn't ready to merge yet, but I'd like @rod-glover to check over the metadata at this point. This YAML file is run by update_metadata to adjust the metadata for a dataset that contains elevation data for the pacific northwest. Some of the attribute values were derived from the PCIC standards, but some were derived from more speculative documents, or previous datasets.

It is treated as gridded observations with no time axis.

After running the update script, the metadata for the file is the following:

netcdf elevation {
dimensions:
    lon = 1088 ;
    lat = 512 ;
variables:
    double lon(lon) ;
        lon:standard_name = "longitude" ;
        lon:long_name = "longitude" ;
        lon:units = "degrees_east" ;
        lon:axis = "X" ;
    double lat(lat) ;
        lat:standard_name = "latitude" ;
        lat:long_name = "latitude" ;
        lat:units = "degrees_north" ;
        lat:axis = "Y" ;
    float elev(lat, lon) ;
        elev:long_name = "Average elevation of grid cell" ;
        elev:units = "m" ;
        elev:_FillValue = NaNf ;
        elev:missing_value = NaNf ;
        elev:cell_methods = "area: mean" ;

// global attributes:
        :CDI = "Climate Data Interface version ?? (http://mpimet.mpg.de/cdi)" ;
        :Conventions = "CF 1.7" ;
        :history =  [elided for length]
        :institution = "Pacific Climate Impacts Consortium" ;
        :title = "VICGL soil parameter file - uncalibrated" ;
        :creation_date = "2018-01-08-T11:05:31Z" ;
        :contact = [Markus' contact info removed]
        :domain = "nwna" ;
        :modeling_realm = "land" ;
        :product = "gridded observations" ;
        :frequency = "fx" ;
        :project_id = "other" ;
        :table_id = "na" ;
        :forcing_type = "na" ;
        :forcing_domain = "na" ;
        :configuration_id = "VICGL_1_0.0 and VICGL_1_0.1" ;
        :method = "Variable Infiltration Capacity Model - Glacier" ;
        :method_id = "VIC-GL" ;
        :version = "VICGL 1.0" ;
        :resolution = "0.0625 decimal-degrees" ;
        :type = "gridded parameters" ;
        :NCO = "\"4.6.0\"" ;
        :CDO = "Climate Data Operators version 1.9.3 (http://mpimet.mpg.de/cdo)" ;
        :institute_id = "PCIC" ;
        :experiment_id = "historical" ;
        :model_id = "base" ;
        :run = "run1" ;
}
corviday commented 5 years ago

Thanks for taking a look at this!

I'm assuming that the values for domain and modeling_realm make sense to the scientists.

modeling_realm is one the scientists gave, though I'm not sure if there's a controlled vocabulary for it I need to compare with. domain is nwna = North West North America; I saw it on another hydrology dataset that covered the same area and decided it applied to this dataset too. If you have any better suggestions, I'd be happy to hear them!

I'm curious about model_id; what does base signify? Why isn't this value null or the empty string? Ditto run. model_id and run suggest that this is model output, yet product == gridded observations, which suggests these are observations. This is on the face of it a bit confusing. Do model_id and run describe a gridding procedure that is applied to station observations (or something like that)?

Model_id and run are supplied because they are required by modelmeta, but this dataset is gridded observations and does not actually have a model or run. We've used these values before for some RVIC data that took gridded observations as an input, as mentioned under the Existing Similar Cases header here.

I agree it's kind of a mess. Any suggestions?

rod-glover commented 5 years ago

Hmm. In that document you linked to (I'd forgotten I'd written it!), there is analysis of how to progress towards something more or less consistent with our current metadata schema for model outputs, but modified for gridded observations. It looks from both the document and the metadata you give above that we went ahead with Alternative A, project_id = 'other', product = 'gridded observations', and the follow on decisions and consequences documented in Analysis, Alternative A. (And if I remember rightly, there was a PR on nchelpers that accommodated this.) So far, so good.

What I see in the present PR that is inconsistent with the documented suggestion/decision is that model_id doesn't have a very helpful value, or this value needs to be documented somewhere. I'd be interested to know what base signifies to the scientists and why they chose it. The document suggests using the name of (presumably) a gridding program/procedure, e.g., TPS_NWNA_v1 (which I'd guess means "Thin Plate Spline, Northwest North America, ver. 1"). If they can use a program name like that, it would be more helpful in the long run I think.

As to the attribute run, I'm not sure what to think. It's not discussed in the document. I would guess that its purpose might be to point to a specific configuration of the gridding program (assuming I am right about this being the context). In that case, it might help to have some adjunct attribute, human readable, that indicates where to find out what its value means. If it's just a placeholder, then I'd omit it altogether unless it makes nchelpers or the indexer burp ... which it might. As a value run1 is pretty innocuous, but it's also pretty empty without context.

Does that help any?

Finally, potentially feeding into this, there is a standard (perhaps still being refined) called Obs4Mips, which likely addresses at least some of these questions. The "Obs" part of its name is "observations", as opposed to model outputs. Lots of people have the same issues to sort out. I was looking at Obs4Mips not long after drafting that document, but various interruptions happened and I didn't get very far with it.

corviday commented 5 years ago

I asked Markus, and he indicated that there actually is a model (whatever that means in this case) for the elevation data, so I will use that.

Should I make "run" something like "na" ?

rod-glover commented 5 years ago

Yeah, there are models and models.

run = 'na' seems like a good idea to me.

corviday commented 5 years ago

Great, thanks! I think we've worked out everything I need to get these into the system.