iterative / cml

♾️ CML - Continuous Machine Learning | CI/CD for ML
http://cml.dev
Apache License 2.0
3.99k stars 333 forks source link

How to set instance recreation times count (exceeded maximum number of attempts error on start up)? #1440

Closed OLSecret closed 3 months ago

OLSecret commented 6 months ago

Summary / Background

I want to run 8xA100 on EC2 - yet as it is in high demand, I would not expect it to get available in 3 retries, nor in 100 - I want it to retry each second it can for X hours until ready (like a bot).

error example:

{"level":"info","message":"iterative_cml_runner.runner: Creating..."}
{"level":"info","message":"iterative_cml_runner.runner: Creation errored after 10s"}
{"level":"error","message":"terraform error: Error: Failed creating the machine: Not able to decode: operation error EC2: RunInstances, exceeded maximum number of attempts, 3, https response error StatusCode: 500, RequestID: 78dbfe11, api error InsufficientInstanceCapacity: We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-west-2a). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-west-2b, us-west-2c."}
{"level":"info","message":"::error::Terraform exited with code 1."}

Scope

So I want to have some option to set X retry attempts or infinite retry when I try to get an instance started. Is there any hidden option for it or at least to set retry count to 99999999?

0x2b3bfa0 commented 3 months ago

We don't provide any inbuilt mechanism to do that, but you can always retry at the shell level.

for attempt in {1..100}; do
  cml runner ...
done