nasa / opera-sds-pcm

Observational Products for End-Users from Remote Sensing Analysis (OPERA)
Apache License 2.0
16 stars 12 forks source link

[Bug]: Orbit File Script does not retry failed HTTP requests properly #796

Closed collinss-jpl closed 7 months ago

collinss-jpl commented 7 months ago

Checked for duplicates

Yes - I've already checked

Describe the bug

All functions that make HTTP requests in the stage_orbit_file.py use the backoff decorator to automatically retry requests after an intermittent failure (such as request throttling by Orbit file server). The function that tests to see if the error is unrecoverable is not implemented correctly, so retries do not occur as expected for certain HTTP error codes.

What did you expect?

When an HTTP request fails with error codes 401, 429, 500, 503, or 504, the backoff decorator retries the request every 15 seconds until the request succeeds or 5 minutes have elapsed.

Reproducible steps

1.
2.
3.
...

Environment

- Version of this software [e.g. vX.Y.Z]
- Operating System: [e.g. MacOSX with Docker Desktop vX.Y]
...
philipjyoon commented 7 months ago

This is the error we saw

[2024-04-08 22:56:23,534: ERROR/_log_giveup] Giving up download_orbit_file(...) after 1 tries (requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://zipper.dataspace.copernicus.eu/odata/v1/Products(a764fd8f-7012-4d8f-9643-298ce806ff8b)/$value) [2024-04-08 22:56:23,534: INFO/main] Requesting deletion of open authentication session Traceback (most recent call last): File "/home/ops/verdi/ops/opera-pcm/data_subscriber/daac_data_subscriber.py", line 321, in <module> asyncio.run(run(sys.argv)) File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete return future.result() File "/home/ops/verdi/ops/opera-pcm/data_subscriber/daac_data_subscriber.py", line 88, in run results["download"] = run_download(args, token, es_conn, netloc, username, password, job_id) # return None File "/home/ops/verdi/ops/opera-pcm/data_subscriber/download.py", line 53, in run_download download.run_download(args, token, es_conn, netloc, username, password, job_id) File "/home/ops/verdi/ops/opera-pcm/data_subscriber/download.py", line 121, in run_download self.perform_download(session, es_conn, downloads, args, token, job_id) File "/home/ops/verdi/ops/opera-pcm/data_subscriber/asf_download.py", line 97, in perform_download self.download_orbit_file(new_dataset_dir, product_filepath, additional_metadata) File "/home/ops/verdi/ops/opera-pcm/data_subscriber/asf_download.py", line 136, in download_orbit_file stage_orbit_file.main(stage_orbit_file_args) File "/home/ops/verdi/ops/opera-pcm/tools/stage_orbit_file.py", line 654, in main output_orbit_file_path = download_orbit_file( File "/home/ops/verdi/lib/python3.9/site-packages/backoff/_sync.py", line 105, in retry ret = target(*args, **kwargs) File "/home/ops/verdi/ops/opera-pcm/tools/stage_orbit_file.py", line 577, in download_orbit_file response.raise_for_status() File "/opt/conda/lib/python3.9/site-packages/requests/models.py", line 1021, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://zipper.dataspace.copernicus.eu/odata/v1/Products(a764fd8f-7012-4d8f-9643-298ce806ff8b)/$value

philipjyoon commented 7 months ago

We saw this error during I&T when running historical processing at 20x+ rate using the batch_proc:

{ "enabled": true, "label": "Test_04_08_2024", "processing_mode": "historical", "include_regions": "cslc-s1_priority_framebased,pse_vnv_request_for_cslc_hist_2024-04-08", "exclude_regions": "", "temporal": true, "data_start_date": "2016-01-01T00:00:00", "data_end_date": "2017-01-01T00:00:00", "last_attempted_proc_data_date": "1900-01-00T01:00:00", "last_successful_proc_data_date": "1900-01-01T01:00:00", "last_run_date": "1900-01-01T12:57:01", "data_date_incr_mins": 2400, "run_interval_mins": 2, "job_type": "slcs1a_query", "collection_short_name": "SENTINEL-1A_SLC", "provider_name": "ASF", "job_queue": "opera-job_worker-slc_data_query_hist", "download_job_queue": "opera-job_worker-slc_data_download_hist", "chunk_size": 1 }