taskcluster / taskcluster-rfcs

Taskcluster team planning
Mozilla Public License 2.0
11 stars 19 forks source link

Worker manager launch configs #191

Closed lotas closed 2 months ago

lotas commented 3 months ago

Below is the proposal to introduce Launch Configurations as standalone entities in Worker Manager.

It also outlines some "self-healing" diagnostic metrics, that could potentially reduce the number of errors we see during provisioning (like resources missing or quotas exceeded).

It also proposes exposing more events from worker manager to make it more transparent and allow external systems to react and build advanced monitoring solutions around it

rendered rfcs/0191-worker-manager-launch-configs.md

gabrielBusta commented 3 months ago

This is a well-thought-out proposal! I really like the inclusion of additional events and the new API endpoints. A thought: if all launch configurations in a worker pool have a low weight due to a high number of failures on every region, would that cause a halt in provisioning? Maybe that's an scenario worth considering.

lotas commented 3 months ago

This is a well-thought-out proposal! I really like the inclusion of additional events and the new API endpoints. A thought: if all launch configurations in a worker pool have a low weight due to a high number of failures on every region, would that cause a halt in provisioning? Maybe that's an scenario worth considering.

Thanks @gabrielBusta , this is indeed a likely scenario and to be honest, I'm not really sure how to tackle this. I think, if each region has same amount of errors, we might give them 0 weight and pause provisioning temporarily. But because we would only check last 30-60min for errors (needs to be figured out), it will "get back to normal" after. This logic is a bit blurry still. Do you have something in mind?

ahal commented 3 months ago

My impression was that the weight would only affect the probability relative to all launch configs, but that overall the rate of provisioning wouldn't change. E.g, if all launch configs have a weight of 0, that's the same as all launch configs having a weight of 1.

Is there a big downside to trying to provision too much when there are lots of errors? If so then I think slowing down the overall rate would be a nifty feature.. though I don't think we should ever turn it off completely (even for an hour). We probably want to at least attempt to provision in 1 of the launch configs at least once every 1-5 minutes.

lotas commented 3 months ago

It's worth also talking about how this would interact with standalone and static workers. I assume that's fairly trivial, but might be relevant when e.g., a config is updated. Maybe static workers can detect that they are running with an out-of-date launchConfig and restart to get the new config.

Hmm.. I thought static and standalone don't have anything configured on the worker pool side, so shouldn't be affected at all. Unless I'm missing something @petemoore @djmitche

djmitche commented 3 months ago

Standalone workers don't have anything configured in worker-manager. Static do -- they get their config from worker-manager, but they are not dynamically provisioned.

lotas commented 3 months ago

Standalone workers don't have anything configured in worker-manager. Static do -- they get their config from worker-manager, but they are not dynamically provisioned.

Indeed.. To my surprise I discovered that static wp is using slightly different schema which kind of breaks my assumption that all worker pool definitions are similar :) We can probably update static wp schema to have single launchConfig to allow proposed solutions to work uniformly for all pools, or we would treat non aws/gcp/azure providers as before.

djmitche commented 3 months ago

I think that makes sense. Then when that launchConfig changes, workers with the old launchConfig can detect and react to that (if implemented).

lotas commented 3 months ago

Thanks Pete, this is important also, and I think I finally understood the root problem there - worker statuses!

So in that particular case, steps that led to launch of many workers without any worker doing anything were following:

pending tasks: 1 desired capacity: 1 requested: 1

looks good, but as soon as that worker was fully launched, it turned into existingCapacity so on next run we have

existing capacity: 1 pending: 1 desired capacity: 2 (+1) requested: 1

and it goes on, because workers start but don't call claimWork, and that pending count keeps spawning new workers.

I think this is simply a bug in the provisioner, or the fact that it doesn't know the actual number of tasks being claimed, as it treats existing capacity as something that is already working on tasks, when in fact, it doesn't. Since we refactored queue internals, we could also tell how many tasks are being claimed for a particular pool. And then do some simple additions/subtractions to figure out that claimed + pending <> existing + requested

UPD: created https://github.com/taskcluster/taskcluster/issues/7052