nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
637 stars 116 forks source link

Allow configuration of EC2 spot request duration #315

Closed mblackgeo closed 3 years ago

mblackgeo commented 4 years ago

When spinning up ec2 spot instances with flintrock for ad-hoc analysis, there can be cases where capacity is not available for a given instance type. Of course, one can wait until AWS releases instances, however often I find that I just cancel the request and spin up a different type of instance. However, as the default timeout of spot requests is 7 days, it may be that after several minutes/hours/days AWS finally release capacity for a cancelled / unneeded request. This then leads the situation where there are "inconsistent" clusters spun up in AWS, and flintrock has not fully configured them. In this case the instances must be manually terminated from the EC2 console.

Additionally, there can be the scenario where a request is cancelled, a new request for a different instance is then created (with the same name), and both requests end up being fulfilled, which the causes issues with describe/destroy. In this PR, I have added support for using a human readable timedelta (e.g. '7d', '5m') in configuration/cli such that the spot request timeout can be controlled by the user (and it retains the default of 7 days). This would alleviate the problems described above, where I could simply reduce the spot request duration to be much shorter than 7 days.

This PR makes the following changes:

I tested this PR by running pytest locally and adding new tests for the util function. However, I have not run any of the acceptance tests.

nchammas commented 3 years ago

Hello @mblack20 and my apologies about taking so long to chime in here! Thank you for contributing to Flintrock. I will take a look at your work and give some feedback this week.

mblackgeo commented 3 years ago

Thanks for the review!

I can understand the reticence to add yet another parameter, however I would argue this is a useful one, as it can help out with issues caused by late fulfilment of clusters, duplicate instances with the same name etc., and in fact in 2+ years of production use of flintrock this has been the only parameter I've been missing.

As for lowering the defaults, that's definitely the easier option, but I'm not sure what a sensible default would be. For example, I'd set it as low as 5 minutes, as if the instance type isn't available immediately I'd rather relaunch with another type. However, I'm not sure this would be suitable for all users and is a big departure from the default of 7 days, hence I opted for moving it to configuration instead.

Happy to hear your thoughts. Thanks!

nchammas commented 3 years ago

Looks like we missed adding this new option to the add-slaves command.

@mblack20 - No worries if you've moved on, but if you have the time, do you mind adding this new option to the add-slaves command, and testing it manually?

mblackgeo commented 3 years ago

Ah sorry I missed that, yes I'll take a look and follow up with a PR