Discussion: Proposal to use Glicko2 algorithm with rating periods rather than TrueSkill

This has come up a few times, but I thought I'd put down a strawman proposal for a new ranking algorithm using Glicko2. First I think it's useful to cover my understanding of the existing algorithms and the state they store:

Elo: This uses a single variable to store each player's skill, and updates it based on the outcome of a single game.
TrueSkill: This is similar to Elo, but uses an extra variable to store the confidence (or uncertainty) of each player's skill. Skill levels with a high confidence change less than those with a low confidence. In addition TrueSkill copes with multiplayer games well (although this isn't too relevant as the ladder is just for 1v1s at the moment). For the ladder we rank players by skill - 3 * uncertainty which is useful to ensure players need to play a good number of games before they can reach the top.
Glicko2: This stores the skill, uncertainty and adds a third variable, the volatility of the uncertainty. Glicko2 allows the uncertainty to increase as well as decrease depending on how recently the player has been active. The uncertainty of players with a low volatility changes less than those with a high volatility. Glicko2 also adds the concept of rating periods, and all games within a period are considered independent of ordering (the assumption is that a player is roughly the same quality at the start and end of a week, although their ability may increase/decrease from e.g. month to month).

I think that the current implementation of Glicko2 which we have in the ladder suffers from a couple of problems:

Players are ranked by skill rather than skill - 3 * uncertainty. This causes the middle of the table to be filled by players with a single game. This is easily fixed though.
Rating periods are not correctly implemented. The ratings are updated after every game, but the uncertainty is never allowed to decrease (making the implementation very similar to the TrueSkill implementation). This means that inactive players are allowed to stay at the top indefinitely. In addition we are not benefiting from the order-independence within a rating period.

I initially thought that by introducing rating periods then it would be simple to game the system by only playing bad players. For example a player at the top of the table might be able to keep their uncertainty down by playing players at the bottom of the table. Having simulated this I'm happy that this is not the case. Games where the skill difference is high (and with the expected result) have a much smaller effect on uncertainty than when the skill levels are about equal.

This all said, my strawman proposal is to use Glicko2 with 1 week rating periods, initial skill of 1500, initial RD (uncertainty) of 350 and initial volatility of 0.3. There's also an internal constant tau and I've been using 0.5 for this. I propose ranking players based on skill - 3 * uncertainty as we do at the moment.

+Notes+

Note that the rating period should not be confused with a reset period. Personally I think the reset period would become redundant with my proposal as inactive players fall from the top ten within a few weeks. If we want to display two ladders still, then maybe an equivalent of the "all time" ladder would be to rank players based purely on their estimated skill (rather than skill - 3*uncertainty).

The updates to rankings would happen at the end of each week, although there's nothing to prevent us calculating this after every game (using the values from the start of the week) to show what the ladder "currently" looks like.

Using ranking periods makes it hard to give a "per game" delta of rating points. We could calculate the ladder rankings using the results excluding and including each game to find a delta though. If we displayed this then we might also want to show the end of week changes (particularly for inactive players), since their increase in uncertainty is unrelated to any particular game.

I don't think it would be unreasonable to keep the results of games for "all time." We wouldn't be able to keep all replays, but these become less useful when new releases of OpenRA are made. I think it wouldn't be prohibitive to keep the end-of-period rankings too, in which case we'd be able to show each players highest position, and how long they kept it for. Perhaps for a fun alternative to the two-monthly ladder, players could see who could hold the number 1 spot for the longest.

Using the all time data from the ladder (up to today) then my proposal gives the following top ten players:

Player          Rating  Played  Won     Skill   Uncertainty*3
morkel          1650    208     161     1960    310
i like men      1641    55      53      2088    447
sup             1561    95      74      2101    540
Mr Cloudy       1509    183     122     1731    223
Death-Sentence  1468    10      9       1863    395
donb            1384    81      51      1558    174
Dassault        1376    14      11      1691    314
goat            1358    21      20      1895    537
dang_shot       1356    60      32      1560    204
Goremented      1310    474     223     1473    163

This compares with the current TrueSkill ladder which looks like:

Player          Rating  Played  Won     Skill   Uncertainty*3
sup             3545    95      74      3908    363
Kav             3491    208     183     3829    337
i like men      3295    55      53      3856    561
Clockwork       3272    82      61      3707    436
Punsho          3071    79      47      3435    363
morkel          2930    208     161     3217    287
goat            2690    21      20      3414    724
Mr Cloudy       2608    183     122     2875    267
Pinkthoth       2445    168     119     2709    264
WhoCares        2426    187     106     2686    259

For comparison Glicko2 puts Kav, Clockwork and Punsho at positions 12, 28 and 29 respectively:

Player          Rating  Played  Won     Skill   Uncertainty*3
Kav             1292    208     183     1903    611
Clockwork       1099    82      61      1787    688
Punsho          1085    79      47      1742    657

Their skill is high enough that if they start playing (and winning) against top tier players then they'd soon be back in the top ten.

So far with these settings then four players have topped the ladder (at the end of a week). By week these are:

Date           Player          Rating  Played  Won     Skill   Uncertainty*3
2020-12-20     Kav             1717    59      54      1916    199
2020-12-27     Kav             1728    99      88      1931    203
2021-01-03     Kav             1770    155     136     1947    177
2021-01-10     Kav             1692    166     147     1982    290
2021-01-17     Kav             1621    168     149     1999    378
2021-01-24     Kav             1535    168     149     1999    463
2021-01-31     morkel          1593    157     113     1803    210
2021-02-07     morkel          1590    162     118     1833    243
2021-02-14     morkel          1572    164     120     1850    278
2021-02-21     morkel          1531    164     120     1850    319
2021-02-28     i like men      1567    40      38      1964    397
2021-03-07     i like men      1537    40      38      1964    427
2021-03-14     morkel          1575    173     128     1893    318
2021-03-21     sup             1647    90      71      2051    403
2021-03-28     sup             1750    95      74      2101    351
2021-04-04     sup             1705    95      74      2101    396
2021-04-11     sup             1665    95      74      2101    437
2021-04-18     sup             1628    95      74      2101    474
2021-04-25     sup             1593    95      74      2101    508
2021-05-02     morkel          1650    208     161     1960    310

Hi,

Very nice writeup and I think it's cool that you experimented with the Glicko2 implementation. You are right about the drawbacks and notes, and I also thought about some of these issues during implementation, though I was not very good at writing them down.

Probably the most noticeable difference between glicko2 and TrueSkill would be the rating period. When I worked on Glicko2, I had the impression people really liked the instantaneous change in rating after a game ("player X beat Y and got Z points from it!") and if we used "regular" Glicko2 rating, we wouldn't get this instant feedback after every game. Consequently, I decided to update the ratings after every game. I agree with you it's a drawback though and would like to see rating periods instead, but this also has some impacts on the ladder apart from the rating system used:

Update the ratings at the end of the week. Currently, the ratings are updated on a per-game basis, so we'd have to change the code somewhat to support this.
Get rid of the "per game" delta of rating points; or at least update it with an estimated delta, e.g. the way you proposed it

Considering it was my first PR to the ladder, and people generally thought it was nice to get instant feedback on the games, I tried to make the Glicko2 implementation be somewhat similar to TrueSkill in some aspects (instant feedback), and otherwise different (periodic calculation of RD and volatility). Implementing 1 and 2 is certainly possible, but we'd probably need the blessing of @ubitux I guess, since these are changes that affect not only the rating code, but also other parts.

While we're at it, I think the following things would also be nice:

Highlight the RD on the graph, either as vertical bars or as a "shadow" (something like this I found on google images):
Hide inactive players. Somewhat same effect as you wrote (players rating fall when they are inactive due to a rise in RD) but here we simply hide the players until they start playing again because the rating of an inactive player is highly uncertain in the first place, there's little point in displaying it (See: "People are vanishing from your lists": http://aligulac.com/about/faq/)
Set an upper limit on the rating difference to avoid farming. (See: "Rating gap cap": http://aligulac.com/about/faq/)

The rankings you have with the parameters you suggested looks pretty nice. I think your suggestion of Glicko2 would be an improvement compared to the TrueSkill (and the current Glicko2) that we have today. Input from @ubitux and other competitive players would also be nice though.

Three other points:

The original paper suggests a rating period so there's an average of 10 - 15 games per player. I think 1 week is sufficient for some active players but insufficient for most other players, if we let 10-15 games per week be the only criteria. I am not sure what the impact is if a player plays 1 game on average per week?
I don't think "original" Glicko 2 displays skill - 3 * uncertainty; they just display skill (and optionally use the RD as a "band" around the estimated skill). Though I might be wrong since it's been a while I read the paper and I didn't read through it properly before commenting..
Per-game rating delta maybe is not so important now? I guess the initial ladder excitement emphasized it alot more, but perhaps players would prefer a more accurate system now, even at the cost of not getting instant updates on a per-game basis.

Thanks for the detailed reply!

Update the ratings at the end of the week. Currently, the ratings are updated on a per-game basis, so we'd have to change the code somewhat to support this.

Yes - I think for this the main change is to store the rating variables for the end of the last period and the provisional rating variables for now. In the simple approach then the variables are only updated after each game, which means the "last period" variables are updated after the first game of the next period (i.e. slightly late). An advantage of this is that it would be easy to display the previous period's ranking.

Highlight the RD on the graph...

I like this suggestion, but I'm proposing that the main value shown on the graph is skill - 3*RD. We could also show an error bar upwards with distance 3*RD. (I wouldn't be against showing it up 6*RD instead - to show the upper bound on the skill level confidence interval)

Hide inactive players...

I made a very minor change (fix?) to the Glicko2 algorithm to cap the RD at the initial RD. This was the case in the Glicko paper, but not in the Glicko2 paper. I think it's an oversight (or I'm doing something wrong), because otherwise the RD of an inactive player can exceed the RD of an unknown player. Assuming we continue to cap the RD then we could hide players who are at the maximum RD which is currently only five players.

Set an upper limit on the rating difference to avoid farming. (See: "Rating gap cap": http://aligulac.com/about/faq/)

I did try to simulate this, by having Clockwork beat me ten times a week. His RD did not improve significantly. The link you reference seems to suggest that a rating gap cap is a bad idea (maybe I'm reading it wrong?)

The original paper suggests a rating period so there's an average of 10 - 15 games per player. I think 1 week is sufficient for some active players but insufficient for most other players...

I agree that it's probably not sufficient for 10-15 games per player, but my simulations seemed to show it worked ok with the current number. Even if we picked a longer rating period then some players will be playing far more than 10-15 games and others far fewer. I'm not wedded to the 1 week period, but it seemed a good compromise that would allow most players to be active while still showing regular rating changes.

I don't think "original" Glicko 2 displays skill - 3 * uncertainty; they just display skill (and optionally use the RD as a "band" around the estimated skill).

You're remembering correctly. The paper talks about ±2*RD being used for the 95% confidence interval, so skill - 3*RD would be the lower bound on some confidence interval (I think 99.7%). I really like using the lower bound on a confidence interval for ranking because it rewards players who prove themselves, and helps avoids "lucky" players being above consistent players who have similar skill levels.

The link you reference seems to suggest that a rating gap cap is a bad idea (maybe I'm reading it wrong?)

Oh, it seems you are right on this! Yeah, perhaps we should not have the rating limit, then. Also nice to hear you already experimented with it and saw that Clockwork didn't gain much from farming.

I made a very minor change (fix?) to the Glicko2 algorithm to cap the RD at the initial RD.

Yes this makes sense! Thought I also checked for this in the code, but apparently not :stuck_out_tongue:

I really like using the lower bound on a confidence interval for ranking because it rewards players who prove themselves, and helps avoids "lucky" players being above consistent players who have similar skill levels.

Also a nice way of thinking about it. In any case, if we decide to include the error bars it should be easy to test out both displayed_rating = rating, error = [-3RD, +3RD] and displayed_rating = rating - 3 RD, error = [0, +6RD] and see which makes sense. If we choose to display rating - 3 RD then it would probably be helpful to add some explanation in the ladder FAQ on the rationale for using it.

Ok, so I guess the change could be something like this (just trying to make a quick summary of what the changes would be):

Assuming...

Rating periods are updated every Sunday

Then...

Leaderboard only displays the most recent rating period (e.g. on a Saturday the rating of a player is the same as they had on Monday)
Latest game display a change in rating for each player, but they are now estimated. That is, it is not possible to get the new rating simply by summing up the individual deltas for every game a player played.
On the leaderboard, the uncertainty is also displayed next to the rating
[Optional] a checkbox to include all players, even those that have been inactive for (say) 4 months, or their RD reached maximum limit (say 350).
Remove "This period" and only keep an "All time" rating (as mentioned above, periods are redundant with Glicko-2)
The displayed rating is rating - 3 * RD
The displayed uncertainty is [0, 6 * RD]
Players with maximum rating (inactive players) are by default not displayed

Does this sound OK? And would it be acceptable for @ubitux ? I guess we'd need help on the display part from Ubitux (at least I have no idea how to do that)

I promised graphs, so here are a few.

@kvalv I think your summary sounds good. One potential change:

Leaderboard only displays the most recent rating period (e.g. on a Saturday the rating of a player is the same as they had on Monday)

I think for this we could display the current rating period with ratings as if the current period has ended (the whole period would need recalculating after every match, but it's not particularly intensive). It would be good to be able to display the previous rating period too (and maybe historical rating periods if we can design a suitable UI).

Yes, I guess that's possible :+1:

Then we would also have to display this "next rating" on the player profile?

Example (sorry for the mess):

|--- week 1 ----|----- week 2 ---- | 
|<games...>     | <games...>       |
                ^      ^
    rating=1000 /       \---TODAY, rating=1200

After week 1 I get a rating of 1000. We are in middle of week 2, and my current "next rating" is 1200. Because the leaderboard says 1200, it also makes sense that it says 1200 in my profile. So...

Should it say 1200 or 1000? The current "correct estimate" rating is 1000, and 1200 is only an estimate of the next estimate. So I would say 1000 is correct, but 1200 fits well with what the leaderboard says :stuck_out_tongue:

ubitux / oraladder

Discussion: Proposal to use Glicko2 algorithm with rating periods rather than TrueSkill #4