sillsdev / silnlp

A set of pipelines for performing experiments on various NLP tasks with a focus on resource-poor/minority languages.
Other
35 stars 3 forks source link

S3 bucket fail - gracefully handle #167

Closed johnml1135 closed 1 year ago

johnml1135 commented 1 year ago

There should be at least 3, if not 10 auto-retries when this happens:

2023-06-08 16:31:00,036 - silnlp.common.environment - INFO - Uploading MT/experiments/FT-Ingush/NLLB_13_CHE_ING_3/val.trg.txt
2023-06-08 16:31:00,153 - silnlp.common.environment - INFO - Uploading MT/experiments/FT-Ingush/NLLB_13_CHE_ING_3/tokenizer.json
2023-06-08 12:31:13
Traceback (most recent call last):
  File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 398, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp/.venv/lib/python3.8/site-packages/urllib3/connection.py", line 239, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/usr/lib/python3.8/http/client.py", line 1256, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp/.venv/lib/python3.8/site-packages/botocore/awsrequest.py", line 94, in _send_request
    rval = super()._send_request(
  File "/usr/lib/python3.8/http/client.py", line 1302, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.8/http/client.py", line 1251, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp/.venv/lib/python3.8/site-packages/botocore/awsrequest.py", line 130, in _send_output
    self._handle_expect_response(message_body)
  File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp/.venv/lib/python3.8/site-packages/botocore/awsrequest.py", line 176, in _handle_expect_response
    self._send_message_body(message_body)
  File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp/.venv/lib/python3.8/site-packages/botocore/awsrequest.py", line 209, in _send_message_body
    self.send(message_body)
  File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp/.venv/lib/python3.8/site-packages/botocore/awsrequest.py", line 218, in send
    return super().send(str)
  File "/usr/lib/python3.8/http/client.py", line 969, in send
    self.sock.sendall(datablock)
  File "/usr/lib/python3.8/ssl.py", line 1204, in sendall
    v = self.send(byte_view[count:])
  File "/usr/lib/python3.8/ssl.py", line 1173, in send
    return self._sslobj.write(data)
ConnectionResetError: [Errno 104] Connection reset by peer
davidbaines commented 1 year ago

Hi John, That will be great. I'm more or less doing the same thing manually every time, but 3 tries is about the limit. It would be good for LTOps or someone to know how often this issue is occurring and whether there is a timeout limit that can be modified if necessary. All the best, David

On Thu, Jun 8, 2023 at 6:04 PM John Lambert @.***> wrote:

There should be at least 3, if not 10 auto-retries when this happens:

2023-06-08 16:31:00,036 - silnlp.common.environment - INFO - Uploading MT/experiments/FT-Ingush/NLLB_13_CHE_ING_3/val.trg.txt 2023-06-08 16:31:00,153 - silnlp.common.environment - INFO - Uploading MT/experiments/FT-Ingush/NLLB_13_CHE_ING_3/tokenizer.json 2023-06-08 12:31:13 Traceback (most recent call last): File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 398, in _make_request conn.request(method, url, **httplib_request_kw) File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp/.venv/lib/python3.8/site-packages/urllib3/connection.py", line 239, in request super(HTTPConnection, self).request(method, url, body=body, headers=headers) File "/usr/lib/python3.8/http/client.py", line 1256, in request self._send_request(method, url, body, headers, encode_chunked) File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp/.venv/lib/python3.8/site-packages/botocore/awsrequest.py", line 94, in _send_request rval = super()._send_request( File "/usr/lib/python3.8/http/client.py", line 1302, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/usr/lib/python3.8/http/client.py", line 1251, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp/.venv/lib/python3.8/site-packages/botocore/awsrequest.py", line 130, in _send_output self._handle_expect_response(message_body) File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp/.venv/lib/python3.8/site-packages/botocore/awsrequest.py", line 176, in _handle_expect_response self._send_message_body(message_body) File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp/.venv/lib/python3.8/site-packages/botocore/awsrequest.py", line 209, in _send_message_body self.send(message_body) File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp/.venv/lib/python3.8/site-packages/botocore/awsrequest.py", line 218, in send return super().send(str) File "/usr/lib/python3.8/http/client.py", line 969, in send self.sock.sendall(datablock) File "/usr/lib/python3.8/ssl.py", line 1204, in sendall v = self.send(byte_view[count:]) File "/usr/lib/python3.8/ssl.py", line 1173, in send return self._sslobj.write(data) ConnectionResetError: [Errno 104] Connection reset by peer

— Reply to this email directly, view it on GitHub https://github.com/sillsdev/silnlp/issues/167, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAH3UM2MEHVJPWBPZNCXUTXKIA2DANCNFSM6AAAAAAY7SHQB4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>

mmartin9684-sil commented 1 year ago

Another occurrence of this issue happened last night; link to the failed experiment here.

@johnml1135 - are we ready to merge this in to master?

ddaspit commented 1 year ago

The fix is on the master branch.