nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
636 stars 116 forks source link

Refactor how we launch spot instances #366

Closed nchammas closed 7 months ago

nchammas commented 7 months ago

This PR refactors how we launch spot instances to migrate away from RequestSpotInstances, which is deprecated.

Instead, we now use the same method -- RunInstances -- whether we are launching on-demand or spot instances. This simplifies the launch code.

This PR also deprecates the --ec2-spot-request-duration option, which does not apply to one-time spot instances (as opposed to persistent spot instances). The distinction between one-time vs. persistent did not exist in the RequestSpotInstances API. However, one-time spot instances map most closely to Flintrock's existing handling of spot instances.

In the future, as previously mentioned here, it would be good to somehow delegate Flintrock's handling of all the various EC2 options down to boto3, instead of exposing individual options and having to track changes to the EC2 API. That wouldn't have spared us this migration from RequestSpotInstances to RunInstances, but it would let users get more of what EC2 can do out of the box without requiring Flintrock to explicitly enable it. But that's a future PR.

I tested this PR by running the full test suite, including acceptance tests, as well as manual tests of launching spot instance clusters and add spot nodes to existing clusters.

Related to #315. Fixes #364.

nchammas commented 7 months ago

cc @mblackgeo - If you're still using Flintrock, I'm curious to know if you agree with my summary of why we don't need --ec2-spot-request-duration anymore. (Asking since you added this feature in #315.)

nchammas commented 7 months ago

cc @maxpoulain

mblackgeo commented 7 months ago

cc @mblackgeo - If you're still using Flintrock, I'm curious to know if you agree with my summary of why we don't need --ec2-spot-request-duration anymore. (Asking since you added this feature in #315.)

Hey thanks for looping me in! Yes we indeed still use Flintrock and with this change to one-time requests it makes sense that the spot request duration is no longer needed. I assume that a Ctrl + C when launching instances with Flintrock will cancel the boto/ec2 request?

nchammas commented 7 months ago

That is correct, Flintrock will cancel the request and terminate any instances that may have launched as part of the same operation. If you find a case where that doesn't happen, that's a bug.

maxpoulain commented 7 months ago

Hi !

Thanks for the cc !

I just have a question and I haven't found any information in boto documentation..

As ec2.create_instances is going to be used for spot instances creation and ec2.create_instances returns a list of instances. What's happen if there is not enough available instances for example ? Is ec2.create_instances going to return provided instances even if it's not the number of instances wanted ? or ec2.create_instances is going to wait until all instances are available ? or ec2.create_instances is just going to fail ?

nchammas commented 7 months ago

Good question. It will fail relatively quickly with a message like this:

$ flintrock launch test
Launching 1 master and 1 slave...
An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max 
  retries: 4): There is no Spot capacity available that matches your request.
No cluster test in region us-east-1.

I believe this is due to how the underlying RunInstances API works, which seems cleaner than the deprecated RequestSpotInstances and also eliminates the need for a spot request duration.

Does this match your expectations, or were you hoping for different behavior?

maxpoulain commented 7 months ago

Yes, it meets our expectations perfectly, it's better/cleaner than using RequestSpotInstances !