nasa / opera-sds-pcm

Observational Products for End-Users from Remote Sensing Analysis (OPERA)
Apache License 2.0
16 stars 12 forks source link

[Bug]: CSLC-S1 Granule IDs from query results in Cloud Watch log files are split between lines #854

Closed sjlewis-jpl closed 5 months ago

sjlewis-jpl commented 5 months ago

Checked for duplicates

No - I haven't checked

Describe the bug

When I went to Cloud Watch to find the lists of CSLC-S1 granule IDs resulting from a query, I noticed that they had been carelessly split between lines. This will make it very difficult to extract those results into any useful format. See the attached screenshot for an example. Given the sheer number of granules (62,462 in this example), it seems like a character limit is being reached in Cloud Watch.

Screenshot 2024-05-24 at 3 02 33 PM

What did you expect?

I expected to be able to easily get the CSLC-S1 granule IDs from the log file, for use in verifying the triggering logic. If a line is sufficiently long for Cloud Watch, I would expect it to be broken up into multiple lines intelligently.

Reproducible steps

1.
2.
3.
...

Environment

- Version of this software: v3.1.0-rc.1.0
- INT-POP1
philipjyoon commented 5 months ago

This issue isn't unique to DISP-S1 processing - this code is called for all CMR queries which apply for all data types.

@sjlewis-jpl Please let me know which of the following two behaviors are desired or both equally suffice:

  1. Write out each granule in its own line
  2. Write out as much as possible on one line without clipping the granule id or just generally not clipping a word

Both are equally trivial to implement.

philipjyoon commented 5 months ago

FYI it seems cloudwatch has a line log limit of 250k; each granule output string is about 100 chars. So we can fit maximum of 2500 granules per line (This would be dynamically determined in code) if option 2 above is preferred.

sjlewis-jpl commented 5 months ago

I definitely prefer option 2.

philipjyoon commented 5 months ago

@sjlewis-jpl how do these look? These are now grouped by a maximum line - for CSLC query these are dynamically determined to be 2450 max per line. In the second pic, you can see that the end is not being clipped because the ] marks the end.

This was tested by submitting a CSLC query job using Tosca with the parameters: --start-date=2023-12-15T08:17:50Z --chunk-size=2 --k=2 --m=1 --job-queue=opera-job_worker-cslc_data_download --processing-mode=forward --end-date=2024-01-15T08:35:59Z

image

image

sjlewis-jpl commented 5 months ago

I like it! You kept the "QUERY RESULTS" for grep-ability, and I really appreciate seeing "X to Y out of Z" in there too.

sjlewis-jpl commented 5 months ago

One question on the log entries: are the last and first entries in successive lines repeated? The index summary at the beginning suggests it might, and in the unfolded example in the second pic, it looks like that last granule might be the first granule in the following line (query result #2450). I think having each input granule listed out once would be most useful, and have the indices match. So, 0 to 2449, then 2450 to 4899, etc.

philipjyoon commented 5 months ago

Good catch. Fortunately the underlying output is correct. The error is just in that xx out of xx message because I naively used the for-loop parameters of start and stop (and start is inclusive whereas stop is exclusive) I'll fix this ASAP.

image

philipjyoon commented 5 months ago

The text output has been fixed. It should have been printing out 1-based instead of 0-based. So it will now read: 1-2450 2451-4900 etc