sillsdev / silnlp

A set of pipelines for performing experiments on various NLP tasks with a focus on resource-poor/minority languages.
Other
30 stars 4 forks source link

Set up GCP autoscaling in clearml #391

Open mshannon-sil opened 1 month ago

mshannon-sil commented 1 month ago

We need the ability to continue running jobs in the event that the server is unavailable or at full capacity. ClearML offers integrated GCP and AWS autoscaling. Since we didn't find AWS to have any GPUs available, we're planning to set up GCP autoscaling.

mshannon-sil commented 1 month ago

GCP autoscaling is up and running. Jobs submitted to the autoscaler queue will run on an A100 80GB GPU that the autoscaler spins up on GCP, and the instance is spun down when the job finishes. Currently, there is a limit of one instance that can run at a time to ensure we're limiting our costs, but that can be changed in the future.

The main limitation at the moment is that we're only able to use preemptible instances. This means a task can be interrupted at any time. There are two approaches we should pursue to address this:

  1. We should add the ability to resume a task if it's interrupted. Since the main obstacle to this is the fact that model checkpoints aren't currently being saved, we should look into saving them every 1000 or 2000 steps, potentially in GCP, so that checkpoints persist if a task doesn't complete training.
  2. We should reach out to GCP to request to raise our quota for on-demand GPUs, which aren't preemptible. We likely would still often prefer to use preemptible instances, since there is a significant discount and preempting so far does not appear to occur too frequently. This means even if approach 2 is successful, approach 1 would still be useful.