iterative / terraform-provider-iterative

☁️ Terraform plugin for machine learning workloads: spot instance recovery & auto-termination | AWS, GCP, Azure, Kubernetes
https://registry.terraform.io/providers/iterative/iterative/latest/docs
Apache License 2.0
290 stars 27 forks source link

Choose AWS Availability Zone that has the specified instance available #668

Closed tadejsv closed 2 years ago

tadejsv commented 2 years ago

I was trying to create a CML runner, but it failed because

Error: Failed creating the machine: Not able to decode: operation error EC2: RunInstances,..., api error 
InsufficientInstanceCapacity: We currently do not have sufficient g4dn.2xlarge capacity in the Availability Zone you requested (eu-central-1b). 
Our system will be working on provisioning additional capacity. 
You can currently get g4dn.2xlarge capacity by not specifying an Availability Zone in your request or choosing eu-central-1a, eu-central-1c."}

Would be great if TPI could:

If request is done using spot fleet, you can probably just create launch templates for all AZs in the region.

0x2b3bfa0 commented 2 years ago

You can use e.g. cml runner ... --cloud-region=eu-central-1a to select manually a different availability zone

0x2b3bfa0 commented 2 years ago

If request is done using spot fleet

It's not 😅

https://github.com/iterative/terraform-provider-iterative/blob/a6d5d526460a57bd5ba3e7934fdf81f99cefc1c3/iterative/aws/provider.go#L292

tadejsv commented 2 years ago

You can use e.g. cml runner ... --cloud-region=eu-central-1a to select manually a different availability zone

The problem with this is:

  1. Currently it's not documented - and would it work? There is a difference between region and availability zone.
  2. This requires me to do manually what can be done automatically (check which AZs have my instance), possibly through trial and error (bc nobody goes first to check available instances) and waste precious development time :)
tadejsv commented 2 years ago

Plus, then every time before I run CI I would need to check AZs, and change the parameter in ci.yaml or wherever - not a very pleasant workflow I think

0x2b3bfa0 commented 2 years ago

Currently it's not documented [...]

It's not 🙈

[...] and would it work? There is a difference between region and availability zone.

Yes, it does: implemented with https://github.com/iterative/terraform-provider-iterative/pull/323, closed https://github.com/iterative/cml/issues/459.

0x2b3bfa0 commented 2 years ago

This requires me to do manually what can be done automatically (check which AZs have my instance), possibly through trial and error (bc nobody goes first to check available instances) and waste precious development time [...] Plus, then every time before I run CI I would need to check AZs, and change the parameter in ci.yaml or wherever - not a very pleasant workflow I think

That's a good point, indeed. If you have to switch availability zones often, it would make sense to automate the process. E.g. the iterative_task resource does, but it's a different beast based on Auto Scaling groups.

0x2b3bfa0 commented 2 years ago

@tadejsv, happy with #670? 😄

tadejsv commented 2 years ago

@0x2b3bfa0 Wow, that was fast! Looks awesome!

tadejsv commented 2 years ago

Thanks for addressing this issue so fast @0x2b3bfa0 ! I have one question here: When can I expect this change to make it to CML :) ? I am using it through iterative/setup-cml@v1 on GitHub

dacbd commented 2 years ago

@tadejsv as soon as the terraform registry has the latest tpi published you should be able to run the same work workflow with the new logic.

cml does not pin the tpi version for runner; currently, if you wanted to do this (pin cml to a specific tpi version) you need to use a hidden flag --tpi-version='=0.11.3' (I wouldn't recommend doing this, as we rarely* make any backward incompatible changes )