podaac / hydrocron

API for retreiving time series of SWOT data
https://podaac.github.io/hydrocron/
Apache License 2.0
18 stars 4 forks source link

Track status of granule ingest #71

Open frankinspace opened 11 months ago

frankinspace commented 11 months ago

Need to be able to determine which granules have been loaded into hydrocron and ensure that every granule ingested into CMR is also loaded in the database.

nikki-t commented 8 months ago

track-status-granule-ingest-short

Here is a draft architecture for tracking the status of which granules have been ingested into Cumulus and which have been inserted into the Hydrocron database.

I think we need to discuss:

torimcd commented 8 months ago

Another thing to consider is that the relationship between granules and records in hydrocron is 1:many, since there are hundreds of river reaches in each reach granule, and thousands of nodes in each node granule. Every record in hydrocron is just one reach or node.

It is possible for some writes to succeed and others to fail when processing a granule, so we may need to also check that all the features get written. The number of features in each granule varies, but it should be constant over subsequent passes, so we could do something like have the number of feature ids expected for each pass hardcoded somewhere and then check that there are the same number for those pass ids? Or we could log the number of features in the shapefile when it's first opened and check against that number when querying for the granule name?

nikki-t commented 8 months ago

@torimcd - That is a great point! So we need to track that all features in a granule have been written to the Hydrocron database.

We can create a map of pass identifiers and associate them with reach and node identifiers. Then we can check if the number of features in a granules matches the number of granules stored in the Hydrocron database for a specific cycle_id, pass_id which are present in the granule filename.

We can also keep track of missing features and modify the load_granule module to accept a granule shapefile and feature ID and load data only for that feature ID. OR we can just submit the entire granule for a rewrite to the Hydrocron database (assuming that we want to take some action with the missing granules and/or features).

frankinspace commented 6 months ago

Just noticed we did log a distinct ticket for step 4 the delete feature: https://github.com/podaac/hydrocron/issues/140

nikki-t commented 6 months ago

Notes on next steps:

  1. Understand how we might query CMR only for recently ingested granules without missing any that may have been reprocessed. The goal is to not query CMR for all SWOT granules every time the track ingest workflow executes.
  2. We may want to stand up a second DynamoDB table to keep track of which granules have been validated. This should include a timestamp and a count of feature IDs which is determined for each granule during the track ingest operations.
  3. We should update the architecture to create and deliver a CNM message to the CNM Lambda function for cases where the workflow identifies a missing granule in the Hydrocron database.
  4. Create a proof of concept to maintain state using the CMR revision_date to understand how to formulate a CMR query on newly available granules so that we do not miss any that may have been reprocessed.
nikki-t commented 6 months ago

Here is an updated architecture based on the tag up next steps and a small proof of concept I completed.

track-status-granule-ingest-v2

I believe that we can query by a temporal range that does not include the entire SWOT River collection. Instead we can save the revision_date timestamp in a separate DynamoDB table.

We can retrieve the most recent revision_date from the table each time the track ingest workflow runs and use that as the starting date to a CMR query.

I am letting the proof of concept run over many hours to see how this might work. The proof of concept saves the revision_date to a JSON file mimicking items in a DynamoDB table.

nikki-t commented 5 months ago

I think I have worked out the logic around querying CMR for a range and not returning the entire SWOT collection. The idea is to:

1) Get a revision start date by querying the track-status table which keeps track of which Hydrocron granules (granule name) have been ingested with a status of "ingested" or "to_ingest", the granule's revision date, and the feature count for the granule (number of identifiers). 1) The revision start date is determined by querying on the revision dates in the database and returning the max date. This was the last revision date that was stored from previous CMR queries.

2) Query CMR with the revision start date and and end date time of either the current time or the current time minus some hours to prevent a race condition between active Hydrocron granule intests (CNM Lambda).

3) Query Hydrocron and return granules with a to_ingest status or cases where the granule's revision date falls into the same range as the step 2 CMR query (as this will pull in any granules that may have changed).

4) Determine the overlap in step 2 CMR query and step 3 Hydrocron query and create a list of granules that have not been ingested (do not exist in the track-status database).

5) Count the number of features in the "to_ingest" list gathered from all granules returned in step 3 to determine if the granule has been fully ingested and if it has set it's status to "ingested". If it is has not been fully ingested add it to the list from step 4.

6) Create a CNM message and publish to appropriate SNS topic to kick off Hydrocron granule ingestion for granules that have not been ingested in list created in step 4 and modified in step 5.

I think with this logic in place we can proceed with defining the full architecture. @frankinspace and @torimcd - let me know what you think!

frankinspace commented 5 months ago

May need to add in tracking the file checksum in order to avoid re-ingest of granules when only metadata has been changed (which causes new revision in cmr).

frankinspace commented 5 months ago

Needs to take into account the fact that river/node are in a distinct collection from the prior lake collection.

nikki-t commented 5 months ago

Updated logic to accommodate file checksum to avoid re-ingest of granules already ingested:

  1. Get revision start time (The most recent date stored in the track-ingest database).
  2. Get revision end time (The current date and time).
  3. Query CMR with a temporal range of step 1) start time and step 2) end time on the revision_date query parameter. a. Store checksums for each granule.
  4. Query Hydrocron for granule URs taken from the CMR query (cannot query on a time range based on the current structure of the DynamoDB tables).
  5. Determine overlap between Hydrocron and CMR. Overlap is determined by step 4 since we cannot query on a time range. a. Create a dictionary of " to_ingest", with filename as key and CMR checksum as value.
  6. Query track-status for all granules set to "to_ingest" status. Add to dictionary of "to_ingest" (step 5).
  7. Count the features of all granules in the "to_ingest" list and determine if all features have been ingested into Hydrocron. a. If the granule has not been ingested, keep in "to_ingest" dictionary (created in Step 5 and modified in Step 6). b. If the granule has been ingested, create and add to "ingested" dictionary (same structure as "to_ingested") and remove from "to_ingested" dictionary.
  8. For all granules in the "to_ingest" dictionary create and publish a CNM message to the CNM Lambda to trigger Hydrocron ingestion. a. Report on granules to be ingested. b. Insert granules with "to_ingest" status into database.
  9. Insert or update all granules from "ingested" list (created step 7b) into track-status database with a status of "ingested".

Running for the first time and populating track-status.

  1. Get all CMR.
  2. Get all Hydrocron.
  3. Overlap. Checksums are taken from CMR.
  4. All granules should be inserted into the track-status database with a status of "to_ingest" and then when the track-status Lambda executes a second time, the features are counted for each granule and inserted into the track-status database with a status of "ingested".