Closed galargh closed 1 year ago
@galargh : not disagreeing that faster is better. How does this boot time compare to what happens with non self-hosted runners? I'm trying to get a sense of why to prioritize this now.
With hosted, the runners are available within seconds. With self-hosted, it's up to 2 minutes. This is OK for long workflows where the speedup is much greater than the boot penalty but it becomes an issue when we want to migrate shorter workflows (which is the case in libp2p/quic-go).
I looked at a couple of instances at random and it looks like machine provisioning (from instance up to job started) is under 30s which is pretty decent. I upped instance count limits cause I noticed in the metrics that we quite often operate over (note to self: we need multi-select on org in the monitoring dashboard). I'm gonna have a look at job queued - lambda triggered and lambda triggered - instance up intervals. It'd be nice to have continuous insights into these but it'd make it a way bigger task.
Job created to lambda scaling up in a normal case takes < 5s, and instance requested to init starting < 10s. Some ideas on why we might be hitting degraded performance (compared to these numbers):
For now, I'll leave it at there is no obvious place that requires optimisation + we upped overall runner type limits + we need to set up continuous monitoring for the self-hosted runner lifecycle (from job requested to runner deprovisioned).
We should investigate if we could decrease the boot up time