sillsdev / silnlp

A set of pipelines for performing experiments on various NLP tasks with a focus on resource-poor/minority languages.
Other
35 stars 3 forks source link

exponential backoff and increase read_timeout for S3 bucket #568

Closed mshannon-sil closed 3 weeks ago

mshannon-sil commented 1 month ago

Due to the issue we were having with connections to the S3 bucket timing out, I've increase the read_timeout from its default setting of 60 seconds to 600 seconds, reason being it takes about 7 minutes to upload a 1.3B NLLB checkpoint. I've also changed the backoff from retrying 10 times with an interval of 5 seconds between attempts, to using an exponential backoff with 10 attempts where the interval ranges from 2^0 seconds to 2^9 seconds.

I tested this by queueing up 3 experiments at the same time, and they were all able to upload checkpoints without timing out. We'll still have to monitor experiments in the near future to verify that this has eliminated the issue.


This change is Reviewable