rackerlabs / spot-roadmap

Spot roadmap
Other
6 stars 0 forks source link

Allow use of non pre-emptible instances to mitigate risk of loss of capacity? #10

Open sirishkr opened 8 months ago

sirishkr commented 8 months ago

It's been discussed internally in the team for some time now, but a good discussion is in the thread here: https://www.reddit.com/r/kubernetes/comments/1bdqys7/comment/kurlk62/?utm_source=share&utm_medium=web2x&context=3

I've read a bit more on your FAQ and etc, and there is pretty much 0% I ever use this in it's current state. The idea that anybody can outbid me and kill my entire production cluster is terrifying. There needs to be some mechanism to ensure people can keep a minimum of ressources. And that mechanism can't be to make super high bid and basically give you unlimited access to my wallet.

I don't even understand why I'm explaining this fear to a hosting company. Would you be OK running the spot.rackspace.com console and UI on such a system ? Would your business be comfortable with a 0% SLA ? The person pushing this business model clearly never ran anything in production, or been chewed by upper management because "the website is slow".

Bids could be capped at a certain maximum. I would maybe bid 2-3 workers at that maximum I'm guaranteed to never be outbid, and then bid lower for other spot instances.

You see, that's at the center of my fears right there. You might not set the price, but I don't either. Others set the price by biding. By saying "you", you're bundling all your clients together. But we are not responsible my services, I am.

Some multi-billion dollar business, somewhere in the solar system can suddenly have a super duper urgent need for ALL the CPU they can get for 1 hour, bid 10x whatever my bid is and drain all my nodes in 5 minutes flat. That probability of the scenario happening is extremely unlikely, but still non-zero. It's unacceptable for the same reason you wouldn't run a Datacenter with no backup generators, even if you're connected to 2 different power grids.

I get spot instances are interesting for batch jobs. But running any app that has SLAs need some non pre-emptible ressources. I've spent my whole career as a Sysadmin and then SRE learning how to make services available for as close to 100% of the time, and this is the exact opposite by design. Even if I need to run something on the cheap, running 100% spot instances is just asking to not sleep well forever.