Open klahnakoski opened 9 years ago
After running SpotManager for a week and a half with large fleets (> 600 instances), this became a rather big issue. I had to babysit spotmanager the whole time, cancelling doomed bids and then blacklisting those type/AZ combos in settings to make spotmanager move on to other types.
I also ran into a situation frequently where I'm pretty sure SpotManager was outbidding itself. It would bid on a chunk of, say, c4.large capacity in us-east-1e, and then lots of running c4.large/us-east-1e instances would be terminated just before the new bids were fulfilled. Amazon seems perfectly happy to let you outbid yourself rather than tell you there's no more available capacity of that type. Technically this saves money in comped partial instance-hours, but in my case, the lost overall efficiency and the increased per-hour price made this a losing strategy.
Just avoiding bidding in a type/AZ if there are "az-group-constraint" bids out isn't enough, because cancelling them will allow the script to waste bids on the next iteration. SpotManager will need to store intelligence about low capacity in a type/AZ (we need a term for that pairing... how about "commodity"?) and avoid bidding in it for awhile.
The termination case is especially tricky, because Amazon does not make it easy to sense exactly when your instances have been terminated. Instances themselves can tell when they're about to be terminated using the 2-minute warning API, but there's no guarantee that the warning will be sent or honored, and trying to get this information back to spotmanager is a pretty serious pain.
The only other option I've found so far is to look for terminated instances with a "state transition reason" of "spot instance termination". However, this will only tell you the time that instance was launched -- there is no "state transition time". SpotManager would have to keep tabs on running instances and store the last time it saw them running in a cache file in order to detect recent terminations. I've currently got a support ticket in with Amazon to find out if there's some missing API to get notifications when instances are terminated.
Overall, SpotManager saved me a ton of time, and I doubt bidding on such a large fleet manually would have been feasible. With improvements in the area of avoiding bad bids, SpotManager could run the show entirely by itself without my intervention.
Good analysis! I believe you make a good argument that SpotManager must track state to be effective at avoiding rebidding; state will also enable smarter analysis and bidding. I am wondering if it is easier to blindly store all known attributes (with timestamps) and let the (future) intelligence reside in the queries off that raw data, or instead, focus on one strategy and the data it requires.
Great question! Especially with the pluggable bidding strategies you mentioned in another ticket, we really have to think this one over. Pluggable bidding strategies would imply that we should store the raw data blindly to facilitate switching algorithms at will. However, that's likely to be a heck of a lot of data -- think 600+ hosts * running ever couple of minutes (in my case) * however long we're keeping history for.
I think instead that we'll have to decide what pieces of information a bidding strategy might want and write code to make it available. We could get really crazy and make a set of data plugins that each gathers an individual time series of information (number of instances running, bids we've made over time, etc) and then have bidding strategies declare dependencies on data plugins, but this is starting to get pretty complex :)
Going the non-pluggable route and just implementing what I've mentioned here might be a good start. We can always write with a pluggable architecture in mind but not go whole-hog yet.
Currently, bids postponed by "az-group-constraint" are only cancelled in save_money() and the lifecycle watcher, neither of which apply to the general case. That leaves only the valid_until to kill those bids off. Furthermore, SpotManager will continue to bid on instance_type/AZ combinations that are doomed to "az-group-constraint". These live bids consume budget and slow down instance creation.
SpotManager should avoid bidding on any type/zone combo in which there's an az-constraint bid out. It should cancel "az-group-constraint" bids immediately, freeing up budget, and try other instance_type/AZ combinations.