Proposal series of steps for processing Errata on an automated basis for ESGF AWS cloud:
Determine the best time frame for an automated checking, reporting on, and acting on new retractions on the ESGF network. I propose once-a-week since the activity regarding the uploading and removing of datasets has slowed considerably over the past many months.
Perform a diagnosis of retracted data. A prototype of such a tool has been created that checks the status of a dataset based upon PID and an individual file's "tracking_id" value. Afterwards, using ESGF search API (metadata tag: retracted=true), retracted dataset ID's can be determined and ultimately cross compared with those of the PID values found earlier.
The datasets flagged in need of retraction are written into an overall "Errata Report" file. This Errata Report can then be shared amongst the community. The need to retract Zarr data based upon the report can also be mentioned.
Remove flagged datasets that are determined not to have a replacement version on ESGF. Include these dataset ID's (PID's can also be used here) in the report.
Replace flagged datasets that are determined to have a replacement version on ESGF with that new version. Include these dataset ID's (or PID's) in the report.
The ESGF search API can be very useful to aggregate datasets described in steps 4 and 5.
Other note: The main errata page (errata.es-doc.org) can also be used in this process if one wishes to provide a description and severity of an errata issued for dataset ID's (or PID's) mentioned above.
Proposal series of steps for processing Errata on an automated basis for ESGF AWS cloud:
Determine the best time frame for an automated checking, reporting on, and acting on new retractions on the ESGF network. I propose once-a-week since the activity regarding the uploading and removing of datasets has slowed considerably over the past many months.
Perform a diagnosis of retracted data. A prototype of such a tool has been created that checks the status of a dataset based upon PID and an individual file's "tracking_id" value. Afterwards, using ESGF search API (metadata tag: retracted=true), retracted dataset ID's can be determined and ultimately cross compared with those of the PID values found earlier.
The datasets flagged in need of retraction are written into an overall "Errata Report" file. This Errata Report can then be shared amongst the community. The need to retract Zarr data based upon the report can also be mentioned.
Remove flagged datasets that are determined not to have a replacement version on ESGF. Include these dataset ID's (PID's can also be used here) in the report.
Replace flagged datasets that are determined to have a replacement version on ESGF with that new version. Include these dataset ID's (or PID's) in the report.
The ESGF search API can be very useful to aggregate datasets described in steps 4 and 5.
Other note: The main errata page (errata.es-doc.org) can also be used in this process if one wishes to provide a description and severity of an errata issued for dataset ID's (or PID's) mentioned above.