Using ERDDAP to build an archival package for NCEI to pickup

MathewBiddle commented 2 years ago

@iamchrisser I wanted to pull the information I found out of an email and into something we can summarize. Feel free to add your experiences in this ticket too.

Let's say this is the dataset I want to archive at NCEI: http://erddap.ioos.us/erddap/tabledap/raw_asset_inventory.html
I can run the ArchiveADataset tool on that dataset and serve that package back through ERDDAP: http://erddap.ioos.us/erddap/files/raw_asset_inventory_ncei/
Those files are also accessible through the various ERDDAP responses, including an htmlTable: http://erddap.ioos.us/erddap/tabledap/raw_asset_inventory_ncei.htmlTable?url%2Cname%2ClastModified%2Csize%2CfileType
- This uses ERDDAP's EDDTableFromFileNames to serve a group of files, not necessarily the data contained in the files.

There's probably some way we can use this functionality to accomplish what we're after with the ATN automation. Maybe there could be something at NCEI that monitors http://erddap.ioos.us/erddap/files/ for new directories/changes to checksums in directories.

MathewBiddle commented 2 years ago

The problem we run into with using ERDDAP to submit data to NCEI is an issue with file fixity. Essentially every time NCEI goes to an ERDDAP endpoint and downloads a file, the resultant file will always have a different checksum. Even if the contents are the same. It's been stated that this is because some of the metadata ERDDAP includes in the resultant file is dynamically generated by ERDDAP and thus gets changed every time someone downloads the dataset, resulting in a different checksum.

The ArchiveADataset tool resolves the fixity problem, however, it does not resolve the transfer mechanism problem. How will NCEI pick up a package that has been generated by that tool?

Put it on a WAF?
Use ERDDAP to re-serve the package? (as described above)

These are questions to explore.

MathewBiddle commented 2 years ago

@BobSimons, FYI.

BobSimons commented 2 years ago

I'll add:

If the data set is available via ERDDAP "files" system and the files (once they are made available) don't change (e.g., for yearly, monthly, or daily files that never change), then having NCEI scan the "files" directory for new files and downloading them is a good solution.

ArchiveADataset presumes that the ERDDAP admin has knowledge of when the dataset's data for a given time range is changing or has finished changing. Once a chunk of data (e.g., the data for a given year or month) has finished changing, then the admin can run ArchiveADataset and send the resulting file to NCEI. The presumption is that NCEI knows less (or nothing) about the dataset and when the data for a given time range is still changing, and also wouldn't know if the data was thought to be unchanging but then changed.

How will NCEI pick up the ArchiveADataset result? That's for the data provider and NCEI to work out. One option is to put the archives in a directory and make an EDDTableFromFileNames dataset which makes the archive files publicly accessible. You could then tell NCEI to scan that directory and pick up any new files.

If these options are insufficient and some other feature should be added to ERDDAP to facilitate archiving to NCEI, please let me know.

relphj commented 2 years ago

The problem I have seen in the past is that when ERDDAP provides the file for download it updates the ":history" attribute to show when the file was "created" and provided for download. This also happens when ArchiveADataset is called to build an archive package out of a set of data files. Thus, every time the file is packaged it is different.

Bob's solution works if and only if the admin knows exactly when a dataset has changed and only calls ArchiveADataset when the dataset data and/or metadata have been updated. This makes it hard to automate.

Ideally, in my view, the system should provide a way that a set of packages could be automatically generated (using ArchiveADataset) but those packages would only be updated when actual data and/or metadata have changed, and possibly only once the admin has indicated those data are ready for archival. For example, let's say a dataset "A" is created. Once the admin is satisfied that "A" is ready for archival, they would set the "ready for archival" flag on the dataset. The automatic archival package generation routine would fire each day, and it would notice the "ready for archival" flag is set, and if a package was not already in the WAF for "A", a new package would be generated (using ArchiveADataset). After that, unless changes were made to the data and/or metadata, the package would remain unchanged. But if changes were made to the data and/or metadata, the automatic archival package generation routine would notice that "A" had been changed more recently than the package in the WAF, and would trigger the building of an updated package to the WAF. In this way, the admin would not have to remember to trigger the generation of a package manually.

BobSimons commented 2 years ago

ERDDAP only puts the request URL and the date of the request in the "history" attribute in .nc files (where there are attributes). Other file types (e.g., .jsonl, .csv) don't have attributes so they don't have a "history" attribute. So there is a way to determine if the data for a given request (e.g., a given time period) has changed: make a non-.nc request and see if the response is different from the previous response to that request. Thus, you could automate the creation of a new archive package based on whether the e.g., .jsonl response has changed.

I hope that helps.

On Wed, Mar 16, 2022 at 9:04 AM John Relph @.***> wrote:

The problem I have seen in the past is that when ERDDAP provides the file for download it updates the ":history" attribute to show when the file was "created" and provided for download. This also happens when ArchiveADataset is called to build an archive package out of a set of data files. Thus, every time the file is packaged it is different.

Bob's solution works if and only if the admin knows exactly when a dataset has changed and only calls ArchiveADataset when the dataset data and/or metadata have been updated. This makes it hard to automate.

Ideally, in my view, the system should provide a way that a set of packages could be automatically generated (using ArchiveADataset) but those packages would only be updated when actual data and/or metadata have changed, and possibly only once the admin has indicated those data are ready for archival. For example, let's say a dataset "A" is created. Once the admin is satisfied that "A" is ready for archival, they would set the "ready for archival" flag on the dataset. The automatic archival package generation routine would fire each day, and it would notice the "ready for archival" flag is set, and if a package was not already in the WAF for "A", a new package would be generated (using ArchiveADataset). After that, unless changes were made to the data and/or metadata, the package would remain unchanged. But if changes were made to the data and/or metadata, the automatic archival package generation routine would notice that "A" had been changed more recently than the package in the WAF, and would trigger the building of an updated package to the WAF. In this way, the admin would not have to remember to trigger the generation of a package manually.

— Reply to this email directly, view it on GitHub https://github.com/ioos/ioos-atn-data/issues/28#issuecomment-1069105240, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALKWODOEGL3IIT3AFGIEQLVAHL6DANCNFSM5NJKLCHA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

MathewBiddle commented 2 years ago

started a Code Sprint project page here https://ioos.github.io/ioos-code-sprint/2022/topics/06-using-erddap-for-ncei-archive.html

MathewBiddle commented 2 years ago

Thanks @BobSimons and @relphj. Unfortunately the agreed upon specification for most of the datasets to be archived through this pathway is netCDF as NCEI needs all the associated metadata to build the archive metadata records.

I'm wondering if it is possible to run ArchiveADataset in a non-interactive way, by listing all the answers on the command line?

I can envision the data provider building a system similar to @relphj's recommendation. Flow would be:

Create/edit configuration file which lists the dataset ID's in the host ERDDAP to be archived at NCEI. "Setting the 'ready for archival flag'".
1. The data provider would have to manage which ones are new/updates.
2. Question: How would the ERDDAP admin know when a previously shared dataset has been updated? They would need to manage that piece somehow.
Script (ran at some frequency TBD by provider) uses the config file to run ArchiveADataset for each dataset. This assumes you can run it by listing all the answers on the command line. $ ArchiveADataset.sh Bagit tar.gz [contact] [datasetID] all .nc SHA-256
Resultant BagIt package is put in the appropriate WAF for NCEI to pick up.
1. It would be nice to use ERDDAP's files system to share the packages, but that clutters up the ERDDAP with duplicate data. Is it possible to have a package available via the files system, but not available through the rest of ERDDAP's services?

BobSimons commented 2 years ago

Regarding 1ii) I again offer this solution: The admin can run a script which requests a non-.nc version of the data (e.g., .jsonl) and calculates the md5 (or sha256 or...) of that data file. Whenever that md5 changes from the previous value, the dataset has changed and is ready to be archived.

Regarding 3i) Yes. Make an EDDTableFromFileNames dataset which points to all the files in a directory (and subdirectories if needed). Any files the administrator puts in that directory which match the dataset's file name regex will be available via ERDDAP's files system.

MathewBiddle commented 2 years ago

Regarding 1ii) I again offer this solution: The admin can run a script which requests a non-.nc version of the data (e.g., .jsonl) and calculates the md5 (or sha256 or...) of that data file. Whenever that md5 changes from the previous value, the dataset has changed and is ready to be archived.

AHH, so you propose the data provider makes some intermediary csv/json/not nc file to check for changes. Thank you for the clarification.

Here's an example in Windows PowerShell of calculating the hash of an erddap csv endpoint:

C:\Users> $wc = [System.Net.WebClient]::new()
C:\Users> Get-FileHash -InputStream ($wc.OpenRead("http://erddap.ioos.us/erddap/tabledap/raw_asset_inventory.csv"))

Algorithm       Hash                                                                   Path
---------       ----                                                                   ----
SHA256          44CD532AD8B1381557DE5252E88428FA1574FA3B04411B974666131E31808174

MathewBiddle commented 2 years ago

jsonl is the preferred format.

There has been request for the ArchiveADataset to specify an external directory. Whatever is in that external directory would be included in the BagIt file.

Add optional yes/no to include ISO metadata record from ERDDAP.

MathewBiddle commented 2 years ago

Run as one liner:

docker run --rm -it \
  -v "$(pwd)/datasets:/datasets" \
  -v "$(pwd)/logs:/erddapData/logs" \
  -v "$(pwd)/erddap/content:/usr/local/tomcat/content/erddap" \
  -v "$(pwd)/erddap/data:/erddapData" \
  axiom/docker-erddap:latest \
  bash -c "cd webapps/erddap/WEB-INF/ && bash ArchiveADataset.sh -verbose BagIt tar.gz default raw_asset_inventory default "" "" .nc SHA-256"

MathewBiddle commented 2 years ago

Java file to work on: https://github.com/BobSimons/erddap/blob/master/WEB-INF/classes/gov/noaa/pfel/erddap/ArchiveADataset.java

TODO:

[ ] @BobSimons to add the capability to include external files into the BagIt package. One of those external files will be the ISO metadata record, if available.
[ ] @MathewBiddle to engage with other partners on what their requirements might be.

MathewBiddle commented 2 years ago

xref:

https://github.com/BobSimons/erddap/issues/45

MathewBiddle commented 1 month ago

@iamchrisser Did ATN end up using this pathway to generate the files for submission to NCEI?

ioos / ioos-atn-data

Using ERDDAP to build an archival package for NCEI to pickup #28