Open sagor999 opened 2 years ago
@atduarte for your consideration. :smile:
Actually, had the opportunity to talk about this epic with @sagor999 in our coffee chat 😅 I agree this is something we want to do, and it's important we have it represented in the project (Thank you!) but currently for me it isn't in the top list, given the number of fires we are trying to put out. Agree with the order here.
Alternative solution to this is spinning up our own bare metal cluster (in hetzner for eg) that would only do prebuilds. And it might be much cheaper then running this in GCP even on spot VMs. (Idea from Alejandro)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
ws-manager
, to listen for when a node is gone, event: NodeShutdown, maybe?ws-manager
, see if the pod finished, and if not delete and recreate the podsQuestion:
Summary
Currently we are using regular VMs for our headless node pool (prebuilds). More efficient would be to use spot VMs (which are 60-90% cheaper). We can have both cheaper and faster VMs if we use spot VMs, since we can use bigger VMs (faster prebuilds) and still save on cost. Spot VMs can be shutdown at any moment though (in GKE google gives several minutes to shutdown VM safely, need to check if same is applicable to GCP VMs), so we need to make sure that if prebuild was running, we can restart it on a different VM instead. In general those shutdowns shouldn't happen frequently.
Another option is to have bare metal cluster (rented outside of GCP). But I think we can pursue both options, have SpotVMs for regions where we might not be able to get bare metal cluster.
Context
Slack thread: https://gitpod.slack.com/archives/C02F19UUW6S/p1643645109149259
Value
Cheaper cost for gitpod, faster prebuilds for users. Win\win.
Potential savings on the headless cost:
(internal) Additional calculations
Acceptance criteria
Prebuilds should be restart-able in the event it did not finish. Headless pool should be working without issues in node pool consisting of Spot VMs only.
Measurement
Hypothesis
In scope
See if we can include image builds. Additionally, if the retry mechanism can live on workspace side, and not have to be triggered by webapp, that would help.
Out of scope
Complexities