racetimeGG / racetime-app

Application code for racetime.gg
https://racetime.gg
GNU General Public License v3.0
41 stars 19 forks source link

DNF Score Adjustment Inconsistency? #110

Open Tonychen0227 opened 3 years ago

Tonychen0227 commented 3 years ago

Hi team,

In this race of Pokemon Platinum, we noticed a strange inconsistency. All three DQ'd participants (3gotystical, Cyrun2, and majorkatsuragi) did not have a race score going into the race, but had different deductions - -71, -72, and -74. We also noticed that in a different race for Pokemon Heartgold/Soulsilver, participant topheeee did not complete the race, but ended with a positive score change.

I notice the following on the Leaderboards page:

Note: leaderboards are currently in beta. The ranking algorithm is, well, a bit of a potato. A more well thought-out calculation system will be implemented later, but you can get a feel for general race performance with the scoring below.

In general, we notice some inconsistencies with the interaction between DNF and the resulting ranking score change, and we were wondering if these were intentional and/or a known bug.

deains commented 3 years ago

All three DQ'd participants (3gotystical, Cyrun2, and majorkatsuragi) did not have a race score going into the race, but had different deductions

I'm not sure why this happens, but it's something to do with the TrueSkill algorithm seemingly not handling ties correctly. Take this simple example:

from trueskill import Rating, global_env
e = global_env()
e.rate([(Rating(),),(Rating(),),(Rating(),)], [1,2,2])

This is simulating 3 brand-new entrants with 1 finisher and 2 forfets (joint forfeits are considered ties). The result is:

[(trueskill.Rating(mu=30.109, sigma=6.735),),
 (trueskill.Rating(mu=22.443, sigma=5.972),),
 (trueskill.Rating(mu=22.448, sigma=5.974),)]

As you can see the score for the two entrants who tied is slightly different. This difference is quite small and typically vanishes over time, so eventually it doesn't matter though it's puzzling to me why it exists in the first place.

We also noticed that in a different race for Pokemon Heartgold/Soulsilver, participant topheeee did not complete the race, but ended with a positive score change.

DNF can sometimes lead to a positive rating change due to a change in a user's confidence value. Behind the scenes, a user's rating depends on two values - their raw score (mu) and the confidence (sigma, P > 0.95) that this score reflects their actual ability. A user with few races under their belt will have a higher confidence value (meaning the raw score is less accurate) than a user who's done a lot of races.

The front-facing rating is caluclated as [ 100 x ( score - ( 2 x confidence ) ) ] (square brackets indicate rounding).

Thus if a user DNFs, their score will go down, but because their confidence value also decreases (because more races means more certainty about the raw score), their final rating may go up slightly. This isn't a bug, but just a consequence of the probability formulae being used behind the scenes. 🙂

zoeyprobably commented 3 years ago

Is there a reason why you guys don't utilize trueskill's weight parameter for ratings on forfeits, instead of considering them all a tie for last place?

zoeyprobably commented 3 years ago

Also, how did you come up with your [ 100 x ( score - ( 2 x confidence ) ) ] formula?

deains commented 3 years ago

Is there a reason why you guys don't utilize trueskill's weight parameter for ratings on forfeits, instead of considering them all a tie for last place?

Mainly 'cos I threw this together in a single afternoon. Also, I think lowering the weight of a forfeit would in fact reduce the score impact? Which seems like a bad fit.

Also, how did you come up with your [ 100 x ( score - ( 2 x confidence ) ) ] formula?

Because:

A real skill of player is between μ±2σ with 95% confidence.

Therefore (score - (2 x confidence)) is the minimum possible score that fits within the 95% confidence interval. It is (hopefully) a sensible lower bound for the user's actual skill. Previously we only took the raw score and ignored the confidence value, but this leads to outliers shooting to the the top of the leaderboard based on a single win.