Open frankinspace opened 11 months ago
Here is a draft architecture for tracking the status of which granules have been ingested into Cumulus and which have been inserted into the Hydrocron database.
I think we need to discuss:
load_granule
Lambda or sending a list in an email to the OPS team.Another thing to consider is that the relationship between granules and records in hydrocron is 1:many, since there are hundreds of river reaches in each reach granule, and thousands of nodes in each node granule. Every record in hydrocron is just one reach or node.
It is possible for some writes to succeed and others to fail when processing a granule, so we may need to also check that all the features get written. The number of features in each granule varies, but it should be constant over subsequent passes, so we could do something like have the number of feature ids expected for each pass hardcoded somewhere and then check that there are the same number for those pass ids? Or we could log the number of features in the shapefile when it's first opened and check against that number when querying for the granule name?
@torimcd - That is a great point! So we need to track that all features in a granule have been written to the Hydrocron database.
We can create a map of pass identifiers and associate them with reach and node identifiers. Then we can check if the number of features in a granules matches the number of granules stored in the Hydrocron database for a specific cycle_id, pass_id which are present in the granule filename.
We can also keep track of missing features and modify the load_granule
module to accept a granule shapefile and feature ID and load data only for that feature ID. OR we can just submit the entire granule for a rewrite to the Hydrocron database (assuming that we want to take some action with the missing granules and/or features).
Just noticed we did log a distinct ticket for step 4 the delete feature: https://github.com/podaac/hydrocron/issues/140
Notes on next steps:
Here is an updated architecture based on the tag up next steps and a small proof of concept I completed.
I believe that we can query by a temporal range that does not include the entire SWOT River collection. Instead we can save the revision_date timestamp in a separate DynamoDB table.
We can retrieve the most recent revision_date from the table each time the track ingest workflow runs and use that as the starting date to a CMR query.
I am letting the proof of concept run over many hours to see how this might work. The proof of concept saves the revision_date to a JSON file mimicking items in a DynamoDB table.
I think I have worked out the logic around querying CMR for a range and not returning the entire SWOT collection. The idea is to:
1) Get a revision start date by querying the track-status table which keeps track of which Hydrocron granules (granule name) have been ingested with a status of "ingested" or "to_ingest", the granule's revision date, and the feature count for the granule (number of identifiers). 1) The revision start date is determined by querying on the revision dates in the database and returning the max date. This was the last revision date that was stored from previous CMR queries.
2) Query CMR with the revision start date and and end date time of either the current time or the current time minus some hours to prevent a race condition between active Hydrocron granule intests (CNM Lambda).
3) Query Hydrocron and return granules with a to_ingest status or cases where the granule's revision date falls into the same range as the step 2 CMR query (as this will pull in any granules that may have changed).
4) Determine the overlap in step 2 CMR query and step 3 Hydrocron query and create a list of granules that have not been ingested (do not exist in the track-status database).
5) Count the number of features in the "to_ingest" list gathered from all granules returned in step 3 to determine if the granule has been fully ingested and if it has set it's status to "ingested". If it is has not been fully ingested add it to the list from step 4.
6) Create a CNM message and publish to appropriate SNS topic to kick off Hydrocron granule ingestion for granules that have not been ingested in list created in step 4 and modified in step 5.
I think with this logic in place we can proceed with defining the full architecture. @frankinspace and @torimcd - let me know what you think!
May need to add in tracking the file checksum in order to avoid re-ingest of granules when only metadata has been changed (which causes new revision in cmr).
Needs to take into account the fact that river/node are in a distinct collection from the prior lake collection.
Updated logic to accommodate file checksum to avoid re-ingest of granules already ingested:
Running for the first time and populating track-status.
Need to be able to determine which granules have been loaded into hydrocron and ensure that every granule ingested into CMR is also loaded in the database.