ioos / glider-dac

The IOOS Glider DAC site/scripts/tools
http://gliders.ioos.us/providers/
8 stars 13 forks source link

Process for modifying metadata on existing data sets #238

Open kerfoot opened 1 year ago

kerfoot commented 1 year ago

Metadata on existing data sets is modified with the use of the extra_atts.json file. The file is created in the following directory:

$PRIV_ERDDAP_ROOT/PROVIDER/DATASET_ID

Here's an example of an extra_atts.json file to change the NC_GLOBAL:id and temperature:standard_name:

{
    "_global_attrs": {
        "id": "mote-genie_20170901T175729Z"
    },
    "temperature": {
        "standard_name": "sea_water_temperature"
    }
}

A few questions:

  1. Once this file has been added with the appropriate changes, when/how are those changes made to the data set and reflected in the ERDDAP dataset and/or archived aggregated files?
  2. Is a compliance check run on the data set once it has been updated? If so, when does this happen?
  3. For previously NCEI archived data sets, how is NCEI notified of the changes so that they can update the accession?

This question is related to the following issues:

  1. issue 160
  2. issue 159
benjwadams commented 1 year ago

This will be going into documentation wiki:

Modifying dataset aggregation attributes with JSON

A file named extra_atts.json may be included in a deployment directory to
modify metadata. The top level keys are used to refer to variables, with the exception
of the _global_attrs key, which changes dataset global attributes.

Here is an example where the global attribute institution has been changed,
along with the attributes valid_min and valid_max in the variable longitude:

{                                                                                                                                                                                                                                             
  "_global_attrs": {                                                                                                                                                                                                                          
    "institution": "Oregon State University"                                                                                                                                                                                                  
  },                                                                                                                                                                                                                                          
  "longitude": {                                                                                                                                                                                                                              
    "valid_min": -180,                                                                                                                                                                                                                        
    "valid_max": 180                                                                                                                                                                                                                          
  }                                                                                                                                                                                                                                           
}                                                                                                                                                                                                                                             

TODO: Add functionality in Gliders Providers app to allow providers to modify metadata. Determine suitable representation in database for this functionality.

kerfoot commented 1 year ago

@mdgrossi

Also need a way to alert the DAC administration team when metadata is updated for data sets previously archived by NCEI in order to signal that the NCEI accession record should be updated:

  1. The data set needs to be re-aggregated with the new metadata
  2. NCEI notification that the record has changed, if they are not already doing this

I have created an API end point allowing users to search for accession records by dataset_id, user or glider. The API will be a part of the new status page application.

mdgrossi commented 1 year ago

As things currently stand, all that should be needed for NCEI is a new manifest (i.e., md5) file to accompany the updated data file. That new manifest would be picked up by NCEI and the filename compared to what we already have in the archive. If the data filename is identical to something already in the archive, the appropriate accession will be updated. (If it's a new filename that is not already archived, it will be picked up and archived as usual.)

That said, we should definitely test this to make sure it works as expected. I suggest trying one of the data sets with a misspelled institution. Even though these aren't archived yet, they're in the "staging" area, so the pickup process should be the same. Just let NCEI know when you do this so that we can keep an eye on it to make sure everything works as it ought to.

kerfoot commented 1 year ago

@benjwadams To clarify: the extra_atts.json file should be placed in the ftp submission directory or the pub erddap location?

kerfoot commented 6 months ago

As discussed on 2024-02-01 technical tag up, I ran into an issue with a previously archived data set (g652-20230901T1200 in which the data provider used the incorrect WMO id.

My understanding is that the snippet for a data set marked as Complete is not regenerated in the event that metadata is modified via submitting an extra_atts.json file. I would like to discuss implementing a watchdog that monitors all data sets marked as Complete. If an extra_atts.json file is uploaded by a data provider, the script would ideally do the following:

  1. Notice the new extra_atts.json file
  2. Place the data set in a queue that would recreate the ERDDAP element
  3. Add the element to the ERDDAP datasets.xml
  4. Wait for a major LoadDatasets event to reload the metadata
  5. Creates an NCEI archival package and the associated md5 sum for NCEI pickup

This will allow for the modification of data set metadata and NCEI archiving under an existing accession number.