Currently, it seems that when there are many jobs running on the AQuA server, issues begin to occur with uploading model checkpoints to the S3 bucket. The following error has been occurring quite frequently in the last few days: botocore.exceptions.ConnectionClosedError: Connection was closed before we received a valid response from endpoint URL:. After 10 attempts to connect, the experiment then fails. It seems that the default read_timeout may be too short, so this should be adjusted, and we should also use an exponential backoff strategy rather than waiting for a fixed number of seconds each time.
Currently, it seems that when there are many jobs running on the AQuA server, issues begin to occur with uploading model checkpoints to the S3 bucket. The following error has been occurring quite frequently in the last few days:
botocore.exceptions.ConnectionClosedError: Connection was closed before we received a valid response from endpoint URL:
. After 10 attempts to connect, the experiment then fails. It seems that the defaultread_timeout
may be too short, so this should be adjusted, and we should also use an exponential backoff strategy rather than waiting for a fixed number of seconds each time.