nansencenter / nansat

Scientist friendly Python toolbox for processing 2D satellite Earth observation data.
http://nansat.readthedocs.io
GNU General Public License v3.0
182 stars 66 forks source link

Standardize metadata and its validation #120

Closed mortenwh closed 8 years ago

mortenwh commented 9 years ago

Nansat should have some standard methods that return required metadata, like:

Following that, generic tests should be made to make sure this metadata is actually added by the mappers.

asumak commented 9 years ago

metadata key name was changed from 'start_date' to 'start_time' in mapper_asar.py in 2c7097a. 'start_date' and 'stop_date' are used in sadcat. If we change to 'start_time' and 'stop_time', we need to change the files in sadcat. Is it necessary to change it?

mortenwh commented 9 years ago

I think we agreed on time instead of date. Shouldn't be a problem to change sadcat but anton can probably answer on that

  1. mars 2015 19:06 skrev "Asuka Yamakawa" notifications@github.com:

metadata key name was changed from 'start_date' to 'start_time' in mapper_asar.py in 2c7097a https://github.com/nansencenter/nansat/commit/2c7097a0c836143d7df9d08bf3a0e766ddc60299 . 'start_date' and 'stop_date' are used in sadcat. If we change to 'start_time' and 'stop_time', we need to change the files in sadcat. Is it necessary to change it?

— Reply to this email directly or view it on GitHub https://github.com/nansencenter/nansat/issues/120#issuecomment-78112258.

akorosov commented 9 years ago

We probably should rather use existing conventions. E.g. UNIDATA suggests (http://www.unidata.ucar.edu/software/thredds/current/netcdf-java/metadata/DataDiscoveryAttConvention.html) to use time_coverage_start time_coverage_end time_coverage_duration time_coverage_resolution

aleksandervines commented 9 years ago

The time_coverage_start which is in many netcdf files produced for normap is on the form "YYYY-MM-DDZ" - The "Z" should not really be here as the timezone is irrelevant when we are only using date and not time. And dateutil.parser.parse fails(in mapper_generic) because this string is not an allowed date string.

I don't think the creation of the Nansat object should fail because of this. This particular example file contains a time dimension - one could read this if setting the time fails from reading metadata.

Also - the VRT._set_time method sets the 'time' variable in the metadata for the bands - but it does not say anywhere what this time supposed to be? start time? end time? average time? all times? From the code I can see it fetches it from start-time-ish variables. But it is, as mentioned in #143, "a bit awkward". And the whole time-metadata thing should be reconsidered imo.

n = Nansat("/WebData/normap.nersc.no/arctic12km_seaice/arctic12km_seaice_20100801_20100831.nc")

=>Arctic Sea Ice Concentration<= Traceback (most recent call last): File "", line 1, in File "/mnt/10.11.12.231/Home/alevin/nansat-v0.6.6/lib/python2.7/site-packages/nansat-0.7_dev.0-py2.7-linux-x86_64.egg/nansat/nansat.py", line 168, in init self.vrt = self._get_mapper(mapperName, _kwargs) File "/mnt/10.11.12.231/Home/alevin/nansat-v0.6.6/lib/python2.7/site-packages/nansat-0.7_dev.0-py2.7-linux-x86_64.egg/nansat/nansat.py", line 1744, in _get_mapper _kwargs) File "/mnt/10.11.12.231/Home/alevin/nansat-v0.6.6/lib/python2.7/site-packages/nansat-0.7_dev.0-py2.7-linux-x86_64.egg/nansat/mappers/mapper_generic.py", line 245, in init self._set_time(parse(gdalMetadata['time_coverage_start'])) File "build/bdist.linux-x86_64/egg/dateutil/parser.py", line 1008, in parse File "build/bdist.linux-x86_64/egg/dateutil/parser.py", line 395, in parse ValueError: Unknown string format

mortenwh commented 9 years ago

We need to have a second look at what metadata is required. We should define which standards to follow and implement them similarly to the wkv.xml file. Perhaps the well-known-variables and additional standards could be stored in a thesaurus-module, or we choose to use xml files?

mortenwh commented 9 years ago

http://www.unidata.ucar.edu/software/thredds/current/netcdf-java/reference/FeatureDatasets/Overview.html

akorosov commented 9 years ago

I agree. We definitely need standardtization. @mortenwh , what do you mean by thesaurus-module ?

mortenwh commented 9 years ago

Basically a dictionary that defines the standards used but let's discuss when you're back :)

  1. okt. 2015 08.55 skrev "Anton Korosov" notifications@github.com:

I agree. We definitely need standardtization. @mortenwh https://github.com/mortenwh , what do you mean by thesaurus-module ?

— Reply to this email directly or view it on GitHub https://github.com/nansencenter/nansat/issues/120#issuecomment-144637920 .

aleksandervines commented 9 years ago

We could validate the metadata before it is added to the vrt dataset by subclassing the gdalDataset (self.dataset) and overriding the SetMetaDataItem function to also validate against the thesaurus.

Something like this:

class NansatDataset(GDALDataset):
    __init__(self):
        super(NansatDataset, self).__init__()
    def SetMetadataItem (self, const char *pszName, const char *pszValue, const char *pszDomain=""):
        #PERFORM VALIADATION
        nansenmetadata.thesaurus.validate(pszName,pszValue)
        return super(NansatDataset, self).SetMetadataItem(pszName, pszValue, pszDomain)
akorosov commented 8 years ago

Most of the mappers set time_coverage_start, time_coverage_end, platform, instrument according to the UNIDATA and GCMD standards. end2endtests check if these attributes are in metadata and correspond to nersc-metadata controlled vocabulary. I consider this ticket is closed in 48348d3 Fore more specific issue a new ticket should be created.