nasa / opera-sds-pcm

Observational Products for End-Users from Remote Sensing Analysis (OPERA)
Apache License 2.0
16 stars 12 forks source link

[Bug]: SoftTimeLimitExceeded & MaxRetryError during load test 0.5X #148

Closed hhlee445 closed 1 year ago

hhlee445 commented 2 years ago

Checked for duplicates

Yes - I've already checked

Describe the bug

When we did load test 0.5x on INT cluster, we noticed that there were SoftTimeLimitExceeded & MaxRetryError with failed jobs.

SoftTimeLimitExceeded tags: trigger-SCIFLO_L3_DSWx_HLS_S30 status: job-failed resource: job index: job_status-current ID: ec65aeee-ec5c-4287-809d-b2ad4f34d245 payload_id: ec65aeee-ec5c-4287-809d-b2ad4f34d245 timestamp: 2022-07-12T23:01:56.136Z job: SCIFLO_L3_DSWx_HLS__1.0.0-rc.1.0-HLS.S30.T37SFU.2022022T081241.v2.0_state_config-20220712T215330.59494Z node: 100.104.40.97 queue: opera-job_worker-sciflo-l3_dswx_hls time queued: 2022-07-12T21:53:30.059511Z | start: 2022-07-12T22:01:07.629814Z | end: 2022-07-12T23:01:12.463573Z duration: 3604.833759s User Tags... TracebackView triaged products Traceback (most recent call last): File "/home/ops/verdi/ops/hysds-1.1.5/hysds/job_worker.py", line 1193, in run_job monitoredRunner.join() File "/home/ops/verdi/lib/python3.9/site-packages/billiard/process.py", line 148, in join res = self._popen.wait(timeout) File "/home/ops/verdi/lib/python3.9/site-packages/billiard/popen_fork.py", line 57, in wait return self.poll(os.WNOHANG if timeout == 0.0 else 0) File "/home/ops/verdi/lib/python3.9/site-packages/billiard/popen_fork.py", line 33, in poll pid, sts = os.waitpid(self.pid, flag) File "/home/ops/verdi/lib/python3.9/site-packages/billiard/pool.py", line 229, in soft_timeout_sighandler raise SoftTimeLimitExceeded() billiard.exceptions.SoftTimeLimitExceeded: SoftTimeLimitExceeded()

http://opera-dev-triage-fwd-pyoon.s3-website-us-west-2.amazonaws.com/triaged_job-SCIFLO_L3_DSWx_HLS__1.0.0-rc.1.0-HLS.S30.T37SFU.2022022T081241.v2.0_state_config-20220712T215330.59494Z_task-6b51fc23-98d5-4bab-aecd-bd50e2a8a558

MaxRetryError Traceback (most recent call last): File "/opt/conda/lib/python3.9/site-packages/urllib3/connection.py", line 174, in _new_conn conn = connection.create_connection( File "/opt/conda/lib/python3.9/site-packages/urllib3/util/connection.py", line 95, in create_connection raise err File "/opt/conda/lib/python3.9/site-packages/urllib3/util/connection.py", line 85, in create_connection sock.connect(sa) TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/conda/lib/python3.9/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/opt/conda/lib/python3.9/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/opt/conda/lib/python3.9/site-packages/urllib3/connectionpool.py", line 1040, in _validate_conn conn.connect() File "/opt/conda/lib/python3.9/site-packages/urllib3/connection.py", line 358, in connect conn = self._new_conn() File "/opt/conda/lib/python3.9/site-packages/urllib3/connection.py", line 186, in _new_conn raise NewConnectionError( urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f3a899a85b0>: Failed to establish a new connection: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/conda/lib/python3.9/site-packages/requests/adapters.py", line 440, in send resp = conn.urlopen( File "/opt/conda/lib/python3.9/site-packages/urllib3/connectionpool.py", line 785, in urlopen retries = retries.increment( File "/opt/conda/lib/python3.9/site-packages/urllib3/util/retry.py", line 592, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='data.lpdaac.earthdatacloud.nasa.gov', port=443): Max retries exceeded with url: /s3credentials (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f3a899a85b0>: Failed to establish a new connection: [Errno 110] Connection timed out'))

What did you expect?

SoftTimeLimitExceeded and MaxRetryErrors are common errors from failed jobs which we can retry those failed jobs automatically.

of retries for SoftTimeLimitExceeded: 2

of retries for MaxRetryError: 5

Here is the document for generic trigger rule handling common failed jobs.

https://hysds-core.atlassian.net/wiki/spaces/HYS/pages/199885482/Generic+Trigger+Rules+for+Mozart+failed+jobs

Reproducible steps

No response

Environment

pcm_branch    = "1.0.0-rc.1.0"
hysds_release = "v4.0.1-beta.8-oraclelinux"
INT cluster
hhlee445 commented 1 year ago

We consolidated download/extract/ingest steps into one job.