Closed johnml1135 closed 1 year ago
Hi John, That will be great. I'm more or less doing the same thing manually every time, but 3 tries is about the limit. It would be good for LTOps or someone to know how often this issue is occurring and whether there is a timeout limit that can be modified if necessary. All the best, David
On Thu, Jun 8, 2023 at 6:04 PM John Lambert @.***> wrote:
There should be at least 3, if not 10 auto-retries when this happens:
2023-06-08 16:31:00,036 - silnlp.common.environment - INFO - Uploading MT/experiments/FT-Ingush/NLLB_13_CHE_ING_3/val.trg.txt 2023-06-08 16:31:00,153 - silnlp.common.environment - INFO - Uploading MT/experiments/FT-Ingush/NLLB_13_CHE_ING_3/tokenizer.json 2023-06-08 12:31:13 Traceback (most recent call last): File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 398, in _make_request conn.request(method, url, **httplib_request_kw) File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp/.venv/lib/python3.8/site-packages/urllib3/connection.py", line 239, in request super(HTTPConnection, self).request(method, url, body=body, headers=headers) File "/usr/lib/python3.8/http/client.py", line 1256, in request self._send_request(method, url, body, headers, encode_chunked) File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp/.venv/lib/python3.8/site-packages/botocore/awsrequest.py", line 94, in _send_request rval = super()._send_request( File "/usr/lib/python3.8/http/client.py", line 1302, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/usr/lib/python3.8/http/client.py", line 1251, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp/.venv/lib/python3.8/site-packages/botocore/awsrequest.py", line 130, in _send_output self._handle_expect_response(message_body) File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp/.venv/lib/python3.8/site-packages/botocore/awsrequest.py", line 176, in _handle_expect_response self._send_message_body(message_body) File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp/.venv/lib/python3.8/site-packages/botocore/awsrequest.py", line 209, in _send_message_body self.send(message_body) File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp/.venv/lib/python3.8/site-packages/botocore/awsrequest.py", line 218, in send return super().send(str) File "/usr/lib/python3.8/http/client.py", line 969, in send self.sock.sendall(datablock) File "/usr/lib/python3.8/ssl.py", line 1204, in sendall v = self.send(byte_view[count:]) File "/usr/lib/python3.8/ssl.py", line 1173, in send return self._sslobj.write(data) ConnectionResetError: [Errno 104] Connection reset by peer
— Reply to this email directly, view it on GitHub https://github.com/sillsdev/silnlp/issues/167, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAH3UM2MEHVJPWBPZNCXUTXKIA2DANCNFSM6AAAAAAY7SHQB4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Another occurrence of this issue happened last night; link to the failed experiment here.
@johnml1135 - are we ready to merge this in to master?
The fix is on the master branch.
There should be at least 3, if not 10 auto-retries when this happens: