iterative / terraform-provider-iterative

☁️ Terraform plugin for machine learning workloads: spot instance recovery & auto-termination | AWS, GCP, Azure, Kubernetes
https://registry.terraform.io/providers/iterative/iterative/latest/docs
Apache License 2.0
287 stars 27 forks source link

gp2 volume type is hard coded #761

Closed kaaloo closed 8 months ago

kaaloo commented 8 months ago

Hi,

I recently ran into an issue with a CML runner created instance that kept failing. AWS support determined the cause to be IOPS limits on gp2 volumes:

This shows that the instance became unreachable due to the attached EBS volume throttling and being unable to field any more requests. This happened since the EBS volume is a gp2 type with 35GB of storage. EBS gp2 volumes have a maximum IOPS rate equal to 3 IOPS/GB [3], so this volume had a max rate of 105 IOPS.

To fix this, you can use a gp3 volume with your instance instead of a gp2 volume. GP3 volumes provide a baseline performance of 3,000 IOPS and 125 MiB/s throughput, regardless of volume size.

Unfortunately the "gp2" VolumeType value is hard coded in the code base.

Would it make sense to expose this value as an option for CML?

dacbd commented 8 months ago

I think upgrading the volume type to gp3 for the cml runner would be an acceptable change. Would you like to make the contribution?

Does it make sense to expose as an option? IMO, no, not with our current capacity to develop new features for cml; I think it should just be a hardcoded upgrade.

kaaloo commented 8 months ago

Oh cool, yes for sure. I'll get a PR ready! Thanks for your super quick response!

kaaloo commented 8 months ago

Hi sorry, I forgot to post that I had a PR for this:

https://github.com/iterative/terraform-provider-iterative/pull/763

dacbd commented 8 months ago

Everything looks fine, we will try address the credentials CI issue and do a merge + release soon.