[New Feature]: DISP-S1 Support for Validator Tool

riverma commented 3 months ago

Checked for duplicates

Yes - I've already checked

Alternatives considered

Yes - and alternatives don't suffice

Describe the feature request

Sample logic:

Query for a set of CSLC products available between START_TIME and END_TIME
For each CSLC product that is available, for example with the ID t087_012345_iw2, use the burst_to_frame.json file to locate the frame IDs that correspond to this CSLC product.
- Result: The frame ID is 45.
Next, use the frame_to_burst.json file to identify all the CSLC burst IDs expected for this frame.
- There are 27 bursts associated with this frame, for example: [t087_012340_iw1, …].
Verify whether all the 27 CSLC burst products have been generated or are available.
- If the 27 CSLC bursts have been generated, we expect a corresponding DISP-S1 product with frame: 45 and those bursts to be referenced
- If there is a missing burst, then we expect to have skipped generating the corresponding DISP-S1 product with frame 45.

Some key resources needed:

Access to CMR for CSLC and DISP-S1 queries
The frame_to_burst.json or burst_to_frame.json
Metadata in DISP-S1 product that lists input CSLCs used

riverma commented 3 months ago

@philipjyoon - did I capture the logic correctly? The above would apply for FWD or HIST regardless, assuming enough time has passed.

philipjyoon commented 3 months ago

@riverma There are a few more dimensions to this:

Not all frames are produced using 27 bursts. Instead of burst_to_frame.json and frame_to_burst.json we should use opera-disp-s1-consistent-burst-ids-with-datetimes.json which contains the real burst pattern information. This is the file OPERA PCM uses - OPERA PCM does not use the former two files mentioned.
In addition to Frame ID we need to also group and reason over by acquisition time index. Within a 12+ day window you can end up with more than one set of CSLC bursts that belong to the same Frame ID but belong to a different DISP-S1 product.
We need to account for Compressed CSLC availability before deciding whether or not a CSLC should have been part of existing DISP-S1 products. We can perhaps get around this by making the end date window of the validation to be, say, at least 24 days from the current date. This way some of the lag in the system would have been resolved at the time of validation.

We discussed one more dimension which we hadn't decided whether it was worth the complexity: verifying the K- and M- files used as input files in producing the DISP-S1 products. There are two ways to look at those K- and M- input files:

These are just another types of ancillary files like DEM and Ionosphere files. In this case we can take the precedence set in previous CMR audits and not validate these ancillary files.
K- and M- are uniquely critical in DISP-S1 product quality beyond the ancillary files. In this view, we should reason over and validate the input files listed in the DISP-S1 product metadata.

(to be continued... I'll write out what I think should be the overall logic tomorrow morning)

philipjyoon commented 3 months ago

Sample logic:

Query for a set of CSLC products available between START_TIME and END_TIME
For each CSLC product that is available, we want to group them by Frame ID and then by Acquisition Day Index. I would use a Python dictionary of a dictionary of a list to do this. This function can be used to determine those two: https://github.com/nasa/opera-sds-pcm/blob/develop/data_subscriber/cslc_utils.py#L312
- So you would have something like: { 45: {600: [CLSC1, CSLC2, ...], 612: [CSLC1, ...]}, 48: {...}}
- We could also put every logic in this data structure and make it a dict of a dict of a dict: the last dict would map Burst ID to CSLC Native ID which would simplify Step 3 below a bit.
Next, iterate that data structure per Frame ID per Acquisition Day Index. You will end up with w list of CSLC IDs that may be complete for DISP-S1 triggering. So we evaluate each of these lists:
- For every item in each list, determine the Burst ID, using the function above, and then create a unique hashset of them
  - If we wish to also validate the input files of DISP-S1 products, we would use a dict mapping Burst ID to CSLC Native ID instead of using a hashset here. We would also have to evaluate the production time at time of insertion to make sure that we are tracking the latest CSLC file in case of Burst ID collision.
- Looking up that Frame ID using opera-disp-s1-consistent-burst-ids-with-datetimes.json determine whether the number of bursts find matches the number is required. If so, a corresponding DISP-S1 product should have been created.
- DISP-S1 Native ID from CMR looks something like this OPERA_L3_DISP-S1_IW_F03050_VV_20240709T000000Z_20240814T000000Z_v0.3_20240815T133432Z This is documented here: https://github.com/nasa/opera-sds-pcm/blob/develop/conf/pge_outputs.yaml#L152 The two important fields are the Frame ID and the "sec_time" which I believe is the Sensing Time or the Acquisition Time.
- We can then concoct a native id pattern to find the corresponding DISP-S1 product from CMR. I will be something like OPERA_L3_DISP-S1_IW_F03050_VV_20240709T000000Z* Use that pattern to query CMR to find that product.
  - A tricky part: Note that the acquisition time used here only has a day precision - the time has been stripped away. Each CSLC burst is acquired within tens of seconds of each other so it's possible that some may cross the day boundary. Therefore, if we don't find a DISP-S1 product using the exact day, we should also search +- one day. This is rare but possible.

Comparison Options:

As mentioned previously if we also want to take M dependency into account, we would have to expand this logic accordingly.
Above logic only checks whether the right DISP-S1 has been produced but does not check whether it was produced using all the correct CSLC input files. To perform the latter, we need to obtain the full metadata. This is not available from CMR to my knowledge. We can obtain in two ways:
1. Download the actual DISP-S1 product from ASF DAAC, open it up, and extract out the full metadata. This is costly since these are large files and also we would need to write code to open these files up. The PCM does not have such code base right now.
2. If we are running this validation on a cluster that contains these products, it would be much better to query the GRQ ES for each product metadata. This would be orders of magnitude cheaper, faster, and easier than above option.

Some key resources needed:

Access to CMR for CSLC and DISP-S1 queries
The frame_to_burst.json or burst_to_frame.json
(Possibly) Metadata in DISP-S1 product that lists input CSLCs used

riverma commented 3 months ago

@philipjyoon - thank you so much for writing out these excellent and clear points! Extremely helpful.

I have a few follow-up questions:

We can then concoct a native id pattern to find the corresponding DISP-S1 product from CMR. I will be something like OPERA_L3_DISP-S1_IW_F03050_VV_20240709T000000Z* Use that pattern to query CMR to find that product. A tricky part: Note that the acquisition time used here only has a day precision - the time has been stripped away. Each CSLC burst is acquired within tens of seconds of each other so it's possible that some may cross the day boundary. Therefore, if we don't find a DISP-S1 product using the exact day, we should also search +- one day. This is rare but possible.

Hmm, can't we just use the same strategy we did for DSWx-S1? Namely:

Get a listing of all CSLC products between START and END, call this "LIST A"
Go through the logic above to get a list of DISP-S1 frames (grouped by acquisition time) that have complete CSLC coverage. Call the full list of CSLCs that cover full frames list "LIST B"
Query CMR for all DISP-S1 products between START and END with the same acquisition (sensor) time as the earliest and latest CSLC's from LIST A. Aggregate the list of CSLCs mentioned within the metadata field "InputGranules" from all available DISP-S1 products in this window, call this list of CSLCs "LIST C"
Compare LIST B with LIST C and note any discrepancies:
- If LIST B has more CSLCs than LIST C, then we have incomplete DISP-S1 products
- If LIST C has more CSLCs than LIST B, then we used too many and the wrong CSLCs for processing

Above logic only checks whether the right DISP-S1 has been produced but does not check whether it was produced using all the correct CSLC input files. To perform the latter, we need to obtain the full metadata. This is not available from CMR to my knowledge. We can obtain in two ways:

The logic I mentioned in the above quote would tell us exactly which CSLCs we should have used. Am I missing something? How would we not know this?

This function can be used to determine those two: https://github.com/nasa/opera-sds-pcm/blob/develop/data_subscriber/cslc_utils.py#L312

Do you have a recommendation on how to import your code? I'm assuming we don't have published packages. Currently the auditing tools are within /report

philipjyoon commented 3 months ago

@riverma I did not realize that CMR query also returns InputGranules If that's the case, yes, what you've outlined would work.

You can use the code here as the general guideline to use cslc_util.py https://github.com/nasa/opera-sds-pcm/blob/develop/tests/data_subscriber/test_cslc_util.py You can import it by from data_subscriber import cslc_utils on a deployed system that would already have data_subscriber package installed. If you wish to install this package independently of deploying a cluster, we'd have to do a little bit of research (I think it's possible)

riverma commented 2 months ago

Next steps based on discussions:

[x] - Utilize opera-disp-s1-consistent-burst-ids-with-datetimes.json rather than opera-s1-disp-0.5.0-frame-to-burst.json since the former is a subset of the latter (just the CSLC products we need to match)
[ ] - Beyond a single orbit worth of validation is currently not supported. Is this a use we want to support? If so - we'll need to update the map_cslc_bursts_to_frames to include acquisition date
[ ] - The function def validate_disp_s1(smallest_date, greatest_date, endpoint, df): needs to be updated to reflect testing once ASF.DAAC has successfully ingested DISP-S1 products to UAT. That way we can test the tool in real-world conditions.
[ ] Refactor the codebase to utilize @philipjyoon's PCM DISP-S1 triggering logic utils

philipjyoon commented 1 month ago

This tool must also take into account blackout dates

philipjyoon commented 2 days ago

We've been told by ASF that the input CSLC granule list CANNOT be stored in CMR because it would break. So we will need to implement this logic without that information - as @philipjyoon described in a comment on Aug 15, 2024

nasa / opera-sds-pcm