NCEI Archiving Checks - Githubissues

MathewBiddle commented 7 years ago

Starting a new topic to get your feedback on what kind of checks and notification you would like for the procedure. As of now, we will be checking and validating the following items (in conjunction with basic validation that the files open, can be read, file names match the convention, http site still functions, etc):

CF Standard Name validates.
Package follows BagIt convention (contains bag-info.txt, bagit.txt, manifest-sha256.txt, and tagmanifest-sha256.txt as well as a data/ directory)
Package validates against manifest files (both manifest-sha256.txt and tagmanifest-sha256.txt)
instrument:long_name exists.
institution and/or creator_institution global attribute exists.
project global attribute exists.
platform:long_name exists and/or external-identifier from the bag-info.txt exists.
sea_name global attribute exists.

I don't want to make the checks too robust so that we are continuously monitoring the process. But I'd like to make sure we are getting the stuff we expected.

Let me know what your thoughts are.

emiliom commented 7 years ago

Thanks. Many of those items are ones we (@cseaton) intend to address in the next data file upload; they are listed in #1 (I've updated the list of issues and TO-DO's to reflect things that have come up). Others are things we've discussed and expect (eg, 2, 3).

Then there are others that are more generic validation we haven't specifically talked about (5, 6, 8). I've been thinking about some of these lately. Do you validate for valid entries (not just existence) for published vocabularies, say for ACDD GCMD keywords, platform and instrument, and for NCEI attributes in platform and instrument variables? And for valid formats for other ACDD metadata (eg, geospatial, time)? Personally I think that'd be helpful. It would also be helpful to get from NCEI, later (January?) a list of what tests you've run and how, and if possible, the results per station or file.

Do you/NCEI run the IOOS compliance checker with the NCEI plugin, to get an official, standardized assessment record? Independently, Charles and I should probably run it later on on all the files we submit, for our own record and future assessment.

I don't want to make the checks too robust so that we are continuously monitoring the process.

I assume we'd make a distinction between tests you will run on our next submission, vs tests you run automatically on all future, operational submissions? Do you run automatic tests on operational submissions?

MathewBiddle commented 7 years ago

So, when I say 'validate' I mean we check your term against an internally managed table and check that we have mapped it to the appropriate metadata element. In some cases we check for formats, and other cases we might just take your term and copy it to the metadata record verbatim.

As for checking the file against the IOOS Compliance Checker, we don't do that. Through the interaction we've been having, I've used the compliance checker to give you feedback on what metadata needs to be modified/added/deleted, but we will not be checking each file once we start the ingest process. It is assumed that whatever your backend process is, will generate the files in compliance with the established format we've agreed upon (valid netCDF file following current conventions) as documented in the ATRAC record.

Right now, I am working through all the data files to see what I can gather consistently to appropriately document the packages you will be submitting. Once I've wrapped my head around, and documented, all of the nuances, I will send my procedure to be implemented. So, I'm not sure how to answer your last question. Yes, there will be a distinction between the questions I've been bringing up and what our automated process will send out. Once we've implemented the procedure, all of the checks will be automated.

MathewBiddle commented 7 years ago

Just in case someone stumbles on this thread. The recommendations I'm providing here are primarily for the NANOOS archival process at NCEI. While some of the information might be useful and applicable to other data sets, these are not blanket statements for all of NCEI's archival procedures.

emiliom commented 7 years ago

Thanks!! And regarding this:

So, I'm not sure how to answer your last question. Yes, there will be a distinction between the questions I've been bringing up and what our automated process will send out. Once we've implemented the procedure, all of the checks will be automated.

No worries. Your expectation that our automated process will reliably reproduce the make-up of files you've extensively assessed at the outset is a reasonable one. But it does make me think that eventually, it'd be good for us to include some sort of auto-check/verification of each file before it is made available to NCEI on the regular, operational schedule. Just to raise flags on obvious problems. For my and Charles' own future reference, auto tests I can think of include:

minimum length of variables (but maybe Charles' code already includes that)
minimum ioos compliance checker (with NCEI plugin) scores

emiliom commented 7 years ago

Just in case someone stumbles on this thread. The recommendations I'm providing here are primarily for the NANOOS archival process at NCEI. While some of the information might be useful and applicable to other data sets, these are not blanket statements for all of NCEI's archival procedures.

Heh. I understand. I'll raise the prominence of that blanket statement by adding it to the repository README file

nanoos-pnw / NCEI-archiving

NCEI Archiving Checks #4