What does ahal want? - Githubissues

klahnakoski commented 4 years ago

@ahal is looking for another metric. What could it be?

ahal commented 4 years ago

When thinking about which scheduler is the "best", there are two inputs to consider:

The regression detection rate
The number of resources scheduled

Currently our Regressions detected per 1000 tasks metric takes both these things into account, spitting out a single number that can be used to determine which scheduler is the best.

My request is to tune the metric such that it can weight the "regression detection rate" differently than the "number of resources scheduled". This would allow us to tweak how important each input is relative to the other when computing the final number.

One possible use case is that on try, we might decide having a higher regression detection rate is more important. So we can create a new metric just for try that weights the regression detection a little higher.

marco-c commented 4 years ago

The current formula is 1000 * precentage of caught regressions / average number of scheduled tasks, we could simply do: 1000 * A * percentage of caught regressions * 1/(B * average number of scheduled tasks) and choose the A and B constants to weigh regressions or scheduled tasks differently

With A=1 B=1: 80% scheduling 100 tasks on average => 8 90% scheduling 120 tasks on average => 7,5

With A=2,5 B=1: 80% scheduling 100 tasks on average => 16 90% scheduling 120 tasks on average => 18,75

Another option would be to just consider those metrics (regression detection rate and number of tasks scheduled) separately, without trying to come up with a "magic formula" to consider them both at the same time. It would take more time to choose the best scheduler as you need to take into account two numbers, but since we have a reasonably small number of shadow schedulers at any given time it's not too much work.

klahnakoski commented 4 years ago

Let

r = percentage of caught regressions
t = average number of scheduled tasks

The weighted plan is really only one weight; a relative weight:

score = 1000 * A * r * 1/(B * t)`
      = 1000 * A/B * r/t 
      = K * r/t

Please note A=2,5, 80% scheduling 100 tasks on average => 20 (not 16). The weighting strategy will not work on this particular formula:

We could consider "weighted" geomean, but the same math applies to weights:

score = sqrt(Ar * B/t)
      = K sqrt(r/t)

Instead of r = percentage of caught regressions, we can consider number-of-nines nn(r) = -log10(1-r), which maintains order (nn(r1) > nn(r2) == r1 > r2), provides more discrimination as we approach r = 100%, and is probably a better measure of the human experience with regard to misses:

r = 80% => nn(r) = 0.69
r = 90% => nn(r) = 1.00
r = 99% => nn(r) = 2.00

We could consider geomean of the nn(r) and t

score = sqrt(nn(r) / t)
      = sqrt(-log10(1-r)/t)

Since the sqrt can be approximated with a constant slope for any range of values find interesting, and we are interested in the difference between values, not the absolute values...

simple_score = -log10(1-r)/t

marco-c commented 4 years ago

The weighted plan is really only one weight; a relative weight:

Yep, in the end it's just A/B, but controlling them separately makes it easier to reason about.

Yet another option, if we don't want to choose based on two numbers, would be to define thresholds that we can't go under (e.g. say 95% on try and 85% on autoland) and then choose the scheduler that (approximately) matches those detection rates scheduling the smallest number of tasks.

mozilla / cia-tasks

What does ahal want? #10