Open mshannon-sil opened 1 month ago
GCP autoscaling is up and running. Jobs submitted to the autoscaler
queue will run on an A100 80GB GPU that the autoscaler spins up on GCP, and the instance is spun down when the job finishes. Currently, there is a limit of one instance that can run at a time to ensure we're limiting our costs, but that can be changed in the future.
The main limitation at the moment is that we're only able to use preemptible instances. This means a task can be interrupted at any time. There are two approaches we should pursue to address this:
We need the ability to continue running jobs in the event that the server is unavailable or at full capacity. ClearML offers integrated GCP and AWS autoscaling. Since we didn't find AWS to have any GPUs available, we're planning to set up GCP autoscaling.