gitpod-io / gitpod

The developer platform for on-demand cloud development environments to create software faster and more securely.
https://www.gitpod.io
GNU Affero General Public License v3.0
12.92k stars 1.24k forks source link

Epic: Spot VMs for headless workspaces #7929

Open sagor999 opened 2 years ago

sagor999 commented 2 years ago

Summary

Currently we are using regular VMs for our headless node pool (prebuilds). More efficient would be to use spot VMs (which are 60-90% cheaper). We can have both cheaper and faster VMs if we use spot VMs, since we can use bigger VMs (faster prebuilds) and still save on cost. Spot VMs can be shutdown at any moment though (in GKE google gives several minutes to shutdown VM safely, need to check if same is applicable to GCP VMs), so we need to make sure that if prebuild was running, we can restart it on a different VM instead. In general those shutdowns shouldn't happen frequently.

Another option is to have bare metal cluster (rented outside of GCP). But I think we can pursue both options, have SpotVMs for regions where we might not be able to get bare metal cluster.

Context

Slack thread: https://gitpod.slack.com/archives/C02F19UUW6S/p1643645109149259

Value

Cheaper cost for gitpod, faster prebuilds for users. Win\win.

Potential savings on the headless cost:

(internal) Additional calculations

Acceptance criteria

Prebuilds should be restart-able in the event it did not finish. Headless pool should be working without issues in node pool consisting of Spot VMs only.

Measurement

Hypothesis

In scope

See if we can include image builds. Additionally, if the retry mechanism can live on workspace side, and not have to be triggered by webapp, that would help.

Out of scope

Complexities

kylos101 commented 2 years ago

@atduarte for your consideration. :smile:

atduarte commented 2 years ago

Actually, had the opportunity to talk about this epic with @sagor999 in our coffee chat 😅 I agree this is something we want to do, and it's important we have it represented in the project (Thank you!) but currently for me it isn't in the top list, given the number of fires we are trying to put out. Agree with the order here.

sagor999 commented 2 years ago

Alternative solution to this is spinning up our own bare metal cluster (in hetzner for eg) that would only do prebuilds. And it might be much cheaper then running this in GCP even on spot VMs. (Idea from Alejandro)

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

kylos101 commented 1 year ago
  1. we could use an informer in ws-manager, to listen for when a node is gone, event: NodeShutdown, maybe?
  2. when the informer tells us when a node is gone, ws-manager, see if the pod finished, and if not delete and recreate the pods

Question:

  1. What would the phase life cycle look like for retries?