`cml runner`: Request spot instances from requirements

courentin commented 2 years ago

What?

Would it be possible to add the ability to request spot instances from a list of requirements rather than an instance type or a GPU type?

For example, I would like to tell cml runner, I want an instance at the lowest price that:

has 2 nvidia GPUs
has at least 8 GB of ram
is the latest instance generation
is in any availability zone
etc.

(more context: discord#cml/1000042237830373406)

Why?

Spot instances are not available 100% of the time and as explained in the aws best practices guide, the less constraints, the more chance we have to fulfil our spot instance request.

Possible solutions

I think we have multiple way of implementing it.

The first and low cost solution would be to allow multiple value for the --cloud-type option:

cml runner
  --cloud-spot
  --cloud-type=g3.4xlarge,g4dn.xlarge,g5.8xlarge

The requirements to instance type conversion would need to be done beforehand. But after all, instance types don't change often.

The second solution would be to implement all the requirement logic into cml runner. Not sure what the api could look like but something like this could be useful:

cml runner
  --cloud-spot
  --cloud-spot-requirement="AcceleratorCount>=1"
  --cloud-spot-requirement="AcceleratorManufacturers=NVIDIA"
  ...

Third solution (basically the second one but probably easier to implement):

{
      "AcceleratorCount": {
          "Min": 1
      },
      "AcceleratorManufacturers": [
          "nvidia"
      ]
}

cml runner
  --cloud-spot
  --cloud-spot-json-requirements=path_to_requirements.json
  ...

0x2b3bfa0 commented 2 years ago

@courentin what are your thoughts on providing a list to --cloud-type when --cloud-spot is active sequentially address the instance types for the first one that is immediately available. (I haven't researched to see if all the providers have some form of requirements spec API like the one @0x2b3bfa0 linked for AWS)

courentin commented 2 years ago

@dacbd it would be very useful

omesser commented 2 years ago

Thanks for raising this @courentin . I think this is very important for viable spot and even on demand GPU instances allocation in the "wild". My thoughts about implementation/ux - options:

Option 1 looks like a nice stop gap solution, but it's putting the burden of researching the instance types on the user.
Option 2 is the primary way to go imo.
With option 3 being a nice additional input imo. but not instead of the straightforward options for the useful dimensions - cpu/mem/GPU/gpu-mem ranges (min/max)

0x2b3bfa0 commented 2 years ago

Option 1 is rather simple to implement but, indeed, makes users responsible for figuring out instance types, which is not ideal
Option 2 is related to https://github.com/iterative/terraform-provider-iterative/issues/158#issuecomment-965625347 and would be handy on every cloud, albeit not easily portable
Option 3 sounds like a nested field in an hypothetical cml.yaml (or toml or xlsx for that matter), in addition to option 2 as @omesser said

iterative / cml