Closed mike-gangl closed 5 months ago
Need to think of the "manual" process for archival (e.g. non-forward stream). I suppose this is essentially submitting jobs to the archive service, but is that done via CNM? an API? a prerequisite for our archive process is that the data must be cataloged in the U-DS first.
I think the main concern is getting enough data to generate the CNM to send to the archive. Not sure what the CNM messages we are sending after a successful (or unsuccessful) catalog look like. i'd assume it's missing some required information:
All of that might be in the STAC catalog, or we can enable extensions to capture it.
William and I looked this over at UDS tag up and we have some initial thoughts / questions.
From some recent MAAP meetings it seems there's some change afoot to provide guidance on what gets DAAC'd and what does not.
I think that VEDA will also start to shape some NASA policy around what's catalogued local to a SDS and what goes into a DAAC. Integration with VEDA (which is basically STAC) is something we should look at.
(note: VEDA is not a data archive)
@ngachung
- What if the project only archives a subset of the files (data and metadata) for a granule at the DAAC? Do we want to clean up all files (data, metadata, other metadata, log, etc.) or save the files that were not archived at DAAC?
this is explicitly why we are separating the archive of files from the storage of files. the configuration for archival needs to be the thing that knows what is sent to an archive and what is not, if they are limited. For example, maybe the stac entries are updated to include "archive" as an asset-role. let's be simple at first, but keep that in mind for the archive configs.
https://github.com/unity-sds/unity-project-management/issues/46
Design this and get feedback from U-DS team ahead of 24.3 PI Planning.
Above is the design (so far) for the archive service components.
Note 1: the components are logical. if we can re-use things that currently exist (e.g. databases) that is fine, but i think they'll probably need a separation of concerns.
Note 2: This also assumes the autocatalog componet is in place, though minimal changes would be needed if it's not in palce.
After a successful catalog operation, the Cumulus instance will send an SNS response message to any SNS topic listed in the Cumulus collection configuration. For archivable products, the 'archive service' SNS topic will be added to the cumulus collection configuration (how?).
The archive service will have an archiver function that processes CNM messages from the topic/queue. its job is to map the fields in the CNM it receives to a DAAC CNM message. It must also map from the MDPS product type to the DAAC product type. This information needs to be provided by the project (How?). The archiver might need access to the product or stac catalogs to properly create the CNM.
The archiver needs to generate an identifier and use that in the CNM sent to the DAAC. it will store this information and the granule/product information (e.g. uri, project, venue, anything else of value). It will then send the CNM to the DAAC.
The archive service will have an SNS/Queue setup to retrieve messages from one or more DAACs. Upon reciept, the Archive Status function uses the identifier embedded in the CNM to determine the status of a archive job (success, error). It will also handle de-duplicaiton of returned CNM messages if necessary. This function updates the database as well as the data catalog with the archive information.
Open questions: