Multi-Queue Fairness - Githubissues

Piszmog commented 1 year ago

With #50613 complete, the Jobs will be provided in a first-in-first-out order. This is not sustainable as some Jobs may take longer than others, or there may be too many Jobs of a single type that prevent other Jobs from being ran in a reasonable amount of time.

We should implement "fairness" that could take into account the following,

Execution time
The number of Job types
- The number of complete
- The number in progress
- The number of errored
The number of available executors of a specific Job type

Done

Implement some form of fairness
Update documentation
Unit Tests
Demo

Technical Direction

Some rough diagrams on possible solutions

Requires

50613

sanderginn commented 1 year ago

The fairness algorithm is an interesting challenge to solve. Some things that come to mind:

Should batch and codeintel jobs get a 50/50 share of execution time?
Do we want to allow users to push batch jobs to the front of the queue?
Is it feasible to estimate job execution time accurately enough (perhaps based on historical data)?

It's probably easy to over-engineer a solution. Do we have access to job metrics from large scale customers? I can imagine it'd be helpful in establishing some edge/corner cases that we need to take into consideration.

Piszmog commented 1 year ago

Should batch and codeintel jobs get a 50/50 share of execution time?

Maybe to start with? When an instance starts fresh, I think it will have to default to this behavior. I also wonder if this execution time percentage should be configurable.

We are also talking about adding another queue (packages). So this share will decrease for all queues.

Do we want to allow users to push batch jobs to the front of the queue?

I think this is a good idea. But maybe just Site Admins would have this ability.

Is it feasible to estimate job execution time accurately enough (perhaps based on historical data)?

This is one of the squishy things, to me. Historical data will be key, but also seems like such a cloudy thing. A random batch spec script could shoot execution time up substantially. Part of me wonders if an AI would be helpful here (lol).

It's probably easy to over-engineer a solution. Do we have access to job metrics from large scale customers?

Oh yes, super easy to over-engineer. I do not think we are collecting any metrics (like execution time) today. So this work may set us up to collect this information.

sanderginn commented 1 year ago

Proposed solution (some arrows refuse to render properly)

sourcegraph / sourcegraph-public-snapshot

Multi-Queue Fairness #50642

Done

Technical Direction

Requires

50613