sillsdev / silnlp

A set of pipelines for performing experiments on various NLP tasks with a focus on resource-poor/minority languages.
Other
35 stars 3 forks source link

Increase robustness of S3 connection for uploading large files #565

Closed mshannon-sil closed 3 weeks ago

mshannon-sil commented 1 month ago

Currently, it seems that when there are many jobs running on the AQuA server, issues begin to occur with uploading model checkpoints to the S3 bucket. The following error has been occurring quite frequently in the last few days: botocore.exceptions.ConnectionClosedError: Connection was closed before we received a valid response from endpoint URL:. After 10 attempts to connect, the experiment then fails. It seems that the default read_timeout may be too short, so this should be adjusted, and we should also use an exponential backoff strategy rather than waiting for a fixed number of seconds each time.