Design Work for Unity data archive services

mike-gangl commented 7 months ago

https://github.com/unity-sds/unity-project-management/issues/46

Design this and get feedback from U-DS team ahead of 24.3 PI Planning.

Above is the design (so far) for the archive service components.

Note 1: the components are logical. if we can re-use things that currently exist (e.g. databases) that is fine, but i think they'll probably need a separation of concerns.

Note 2: This also assumes the autocatalog componet is in place, though minimal changes would be needed if it's not in palce.

After a successful catalog operation, the Cumulus instance will send an SNS response message to any SNS topic listed in the Cumulus collection configuration. For archivable products, the 'archive service' SNS topic will be added to the cumulus collection configuration (how?).

The archive service will have an archiver function that processes CNM messages from the topic/queue. its job is to map the fields in the CNM it receives to a DAAC CNM message. It must also map from the MDPS product type to the DAAC product type. This information needs to be provided by the project (How?). The archiver might need access to the product or stac catalogs to properly create the CNM.

The archiver needs to generate an identifier and use that in the CNM sent to the DAAC. it will store this information and the granule/product information (e.g. uri, project, venue, anything else of value). It will then send the CNM to the DAAC.

The archive service will have an SNS/Queue setup to retrieve messages from one or more DAACs. Upon reciept, the Archive Status function uses the identifier embedded in the CNM to determine the status of a archive job (success, error). It will also handle de-duplicaiton of returned CNM messages if necessary. This function updates the database as well as the data catalog with the archive information.

Open questions:

do we update the cumulus database schema to host archive information (e.g. archive status (in progress, success, error, not-for-archive)? Or should it be a link to a record in the archive service?
Eventually we might want to "clean up" files in the project venue once archived and pull them back when needed. How would we enable that? Would we update the U-DS data catalog with the granule locations?
Whats the error scenario? If a file is not configured correctly to send to a DAAC (e.g. missing required metadata) or If a file fails ingest at the DAAC, it's easy enough to have a flag in the archive DB of that, but how do we notify the operator? This same question probably arrises in the automatic cataloging scenario when an error occurs.

mike-gangl commented 6 months ago

Need to think of the "manual" process for archival (e.g. non-forward stream). I suppose this is essentially submitting jobs to the archive service, but is that done via CNM? an API? a prerequisite for our archive process is that the data must be cataloged in the U-DS first.

mike-gangl commented 6 months ago

I think the main concern is getting enough data to generate the CNM to send to the archive. Not sure what the CNM messages we are sending after a successful (or unsuccessful) catalog look like. i'd assume it's missing some required information:

DAAC Collection
file checksums
file sizes
data version
Mapping the files in the U-DS to the files (and their subtypes) needed at the DAAC.

All of that might be in the STAC catalog, or we can enable extensions to capture it.

ngachung commented 6 months ago

William and I looked this over at UDS tag up and we have some initial thoughts / questions.

What if the project only archives a subset of the files (data and metadata) for a granule at the DAAC? Do we want to clean up all files (data, metadata, other metadata, log, etc.) or save the files that were not archived at DAAC?
For forward stream, user can update the collection via API to turn on DAAC delivery and provide required information such as DAAC collection, DAAC SNS.
For non-forward stream, user can search for UDS granules they want to send to DAAC and invoke an archive API with the same search parameters.

rtapella commented 6 months ago

From some recent MAAP meetings it seems there's some change afoot to provide guidance on what gets DAAC'd and what does not.

I think that VEDA will also start to shape some NASA policy around what's catalogued local to a SDS and what goes into a DAAC. Integration with VEDA (which is basically STAC) is something we should look at.

rtapella commented 6 months ago

(note: VEDA is not a data archive)

mike-gangl commented 6 months ago

@ngachung

What if the project only archives a subset of the files (data and metadata) for a granule at the DAAC? Do we want to clean up all files (data, metadata, other metadata, log, etc.) or save the files that were not archived at DAAC?

this is explicitly why we are separating the archive of files from the storage of files. the configuration for archival needs to be the thing that knows what is sent to an archive and what is not, if they are limited. For example, maybe the stac entries are updated to include "archive" as an asset-role. let's be simple at first, but keep that in mind for the archive configs.

unity-sds / unity-project-management

Design Work for Unity data archive services #172