Set up GCP autoscaling in clearml

GCP autoscaling is up and running. Jobs submitted to the autoscaler queue will run on an A100 80GB GPU that the autoscaler spins up on GCP, and the instance is spun down when the job finishes. Currently, there is a limit of one instance that can run at a time to ensure we're limiting our costs, but that can be changed in the future.

The main limitation at the moment is that we're only able to use preemptible instances. This means a task can be interrupted at any time. There are two approaches we should pursue to address this:

We should add the ability to resume a task if it's interrupted. Since the main obstacle to this is the fact that model checkpoints aren't currently being saved, we should look into saving them every 1000 or 2000 steps, potentially in GCP, so that checkpoints persist if a task doesn't complete training.
We should reach out to GCP to request to raise our quota for on-demand GPUs, which aren't preemptible. We likely would still often prefer to use preemptible instances, since there is a significant discount and preempting so far does not appear to occur too frequently. This means even if approach 2 is successful, approach 1 would still be useful.

sillsdev / silnlp

Set up GCP autoscaling in clearml #391