unity-sds / unity-project-management

Container repo for project management (projects, epics, etc)
Apache License 2.0
0 stars 1 forks source link

Automated Data Cataloging #141

Open mike-gangl opened 3 months ago

mike-gangl commented 3 months ago

Automated Data Cataloging

In order to make the system more user friendly, we should allow for data cataloging when the results of a stage-out operation (successful_features.json) file is stored in S3. this is an optional parameter (defaults to enabled) created during the s3 bucket deployment from marketplace (e.g. "disable auto-cataloging of files").

This is to remove the need for a user being forced to add a 'catalog' task in a CWL workflow in order to persist files in the unity catalog. Doing this will do 3 things:

  1. Remove an annoyance on the user of having to create a workflow with an added task for each application package they want to catalog results for
  2. Remove the experience/expertise of needing to understand CWL to write the above update
  3. makes our CWL a bit more portable, as it no longer includes a workflow reliant on the unity system (though stage-out, in remain).

Acceptance Criteria

Acceptance criteria required to implement the epic

Why: Separate data catalog logic from SPS, Workflows. These are specific to "Unity" and wouldn't be run outside of the Unity Context.

Work Tickets

Link to work tickets required to implement the epic

Dependencies

Other epics or outside tickets required for this to work

Associated Risks

links to risk issues associated with this epic

wphyojpl commented 2 months ago
  1. Upload Process results in 2 files: successful_features.json and failed_features.json
    1. User will perform re-try for failed features.
  2. successful_features.json is uploaded to S3 Bucket which is a FeatureCollection with actual granule items.
    1. It can be replaced with a catalog.json which has links to S3 granules STAC files: {"rel": "item", "href": "s3://bucket/some_granules/test_file02.nc.stac.json", "type": "application/json"}
      1. This is because the granules item is the same STAC with the actual stac metadata file.
    2. It may not work for transient files (L1B) that do not have stac metadata files.
  3. Upload should be updated to a special prefix such as "s3://bucket/upload_results" in S3.
    1. S3 Bucket should send out events when a new file is added with prefix or folder "upload_results".
    2. This is to reduce noise in events. The entire bucket with json extensions can be published, but that would include individual stac metadata files.
    3. catalog.json can be made unique by having a timestamp in the filename like catalog.131275543.json
  4. There will be a pipeline: S3 Events -> SNS -> SQS -> Lambda
  5. If Lambda is failed, the message should be passed back to SQS to for a re-try after 5 min up to 3 times.
  6. Lambda workflow:
    1. Lambda will check if all files are available in S3.
      1. If not, it will send back to SQS with assumption that they will be availalbe soon although it should not happen 99% of the time.
    2. Lambda will retrieve the Collection ID from stac metadata file.
    3. Lambda will assume that all files from the same catalog.json belongs to the same collection. and won't validate it.
    4. Lambda will check if collection exists and will attempt to create one if it does not exist.
      1. Information needed for a collection is listed below.
        1. This needs further discussion.
      2. If collection creation fails, it will send the message back to SQS and notifies admin via SNS.
    5. Lambda will submit cnm request via SNS.
  7. Granules to ES Lambda can write the cnm response to S3 bucket if needed.
dapa_collection = UnityCollectionStac() \
    .with_id(temp_collection_id) \
    .with_graule_id_regex("^abcd.1234.efgh.test_file.*$") \
    .with_granule_id_extraction_regex("(^abcd.1234.efgh.test_file.*)(\\.data\\.stac\\.json|\\.nc\\.cas|\\.cmr\\.xml)") \
    .with_title(f"{self.granule_id}.data.stac.json") \
    .with_process('stac') \
    .with_provider('unity') \
    .add_file_type(f"{self.granule_id}.data.stac.json", "^abcd.1234.efgh.test_file.*\\.data.stac.json$", 'unknown_bucket', 'application/json', 'root') \
    .add_file_type(f"{self.granule_id}.nc", "^abcd.1234.efgh.test_file.*\\.nc$", 'protected', 'data', 'item') \
    .add_file_type(f"{self.granule_id}.nc.cas", "^abcd.1234.efgh.test_file.*\\.nc.cas$", 'protected', 'metadata', 'item') \
    .add_file_type(f"{self.granule_id}.nc.cmr.xml", "^abcd.1234.efgh.test_file.*\\.nc.cmr.xml$", 'protected', 'metadata', 'item') \
    .add_file_type(f"{self.granule_id}.nc.stac.json", "^abcd.1234.efgh.test_file.*\\.nc.stac.json$", 'protected', 'metadata', 'item')
ngachung commented 2 months ago

Use Case: App Pack Gen no longer has to call catalog after every stage out call

rtapella commented 2 months ago

Is there logging at each step to message errors back up to the mission Operator?

mike-gangl commented 1 month ago

@rtapella that's a concern for sure. what will alert users to an error in the catalogging if it happens? We have some options here, but i wonder what it should look like.

mike-gangl commented 1 month ago

i think the "archive" service is very similar, to be honest. Cumulus has a dashboard with this information- but it's not what we'd want to expose to the other users, i don't think.

rtapella commented 1 month ago

I think it should get pushed up to the Airflow logs as part of processing.

rtapella commented 1 month ago

"successful_features.json is uploaded to S3 Bucket"... where does failed_features.json go @wphyojpl ?

wphyojpl commented 1 month ago

@rtapella

I think existing logic still remains. It will be stored locally.. It is up to the user to fix the problem, and upload them again..

rtapella commented 1 month ago

"locally" meaning what? in the unity ds wherever stage-out writes to?

wphyojpl commented 1 month ago

Yea.. it is stored in the server that the stage-out script is run.

rtapella commented 1 month ago

Is there some sort of error message associated with each failed item?

wphyojpl commented 1 month ago

Yes. the exception messages are added so that the user can fix them.