Open tttppp opened 3 years ago
Using the all time data from the ladder (up to today) then my proposal gives the following top ten players:
Player Rating Played Won Skill Uncertainty*3
morkel 1650 208 161 1960 310
i like men 1641 55 53 2088 447
sup 1561 95 74 2101 540
Mr Cloudy 1509 183 122 1731 223
Death-Sentence 1468 10 9 1863 395
donb 1384 81 51 1558 174
Dassault 1376 14 11 1691 314
goat 1358 21 20 1895 537
dang_shot 1356 60 32 1560 204
Goremented 1310 474 223 1473 163
This compares with the current TrueSkill ladder which looks like:
Player Rating Played Won Skill Uncertainty*3
sup 3545 95 74 3908 363
Kav 3491 208 183 3829 337
i like men 3295 55 53 3856 561
Clockwork 3272 82 61 3707 436
Punsho 3071 79 47 3435 363
morkel 2930 208 161 3217 287
goat 2690 21 20 3414 724
Mr Cloudy 2608 183 122 2875 267
Pinkthoth 2445 168 119 2709 264
WhoCares 2426 187 106 2686 259
For comparison Glicko2 puts Kav, Clockwork and Punsho at positions 12, 28 and 29 respectively:
Player Rating Played Won Skill Uncertainty*3
Kav 1292 208 183 1903 611
Clockwork 1099 82 61 1787 688
Punsho 1085 79 47 1742 657
Their skill is high enough that if they start playing (and winning) against top tier players then they'd soon be back in the top ten.
So far with these settings then four players have topped the ladder (at the end of a week). By week these are:
Date Player Rating Played Won Skill Uncertainty*3
2020-12-20 Kav 1717 59 54 1916 199
2020-12-27 Kav 1728 99 88 1931 203
2021-01-03 Kav 1770 155 136 1947 177
2021-01-10 Kav 1692 166 147 1982 290
2021-01-17 Kav 1621 168 149 1999 378
2021-01-24 Kav 1535 168 149 1999 463
2021-01-31 morkel 1593 157 113 1803 210
2021-02-07 morkel 1590 162 118 1833 243
2021-02-14 morkel 1572 164 120 1850 278
2021-02-21 morkel 1531 164 120 1850 319
2021-02-28 i like men 1567 40 38 1964 397
2021-03-07 i like men 1537 40 38 1964 427
2021-03-14 morkel 1575 173 128 1893 318
2021-03-21 sup 1647 90 71 2051 403
2021-03-28 sup 1750 95 74 2101 351
2021-04-04 sup 1705 95 74 2101 396
2021-04-11 sup 1665 95 74 2101 437
2021-04-18 sup 1628 95 74 2101 474
2021-04-25 sup 1593 95 74 2101 508
2021-05-02 morkel 1650 208 161 1960 310
Hi,
Very nice writeup and I think it's cool that you experimented with the Glicko2 implementation. You are right about the drawbacks and notes, and I also thought about some of these issues during implementation, though I was not very good at writing them down.
Probably the most noticeable difference between glicko2 and TrueSkill would be the rating period. When I worked on Glicko2, I had the impression people really liked the instantaneous change in rating after a game ("player X beat Y and got Z points from it!") and if we used "regular" Glicko2 rating, we wouldn't get this instant feedback after every game. Consequently, I decided to update the ratings after every game. I agree with you it's a drawback though and would like to see rating periods instead, but this also has some impacts on the ladder apart from the rating system used:
Considering it was my first PR to the ladder, and people generally thought it was nice to get instant feedback on the games, I tried to make the Glicko2 implementation be somewhat similar to TrueSkill in some aspects (instant feedback), and otherwise different (periodic calculation of RD and volatility). Implementing 1 and 2 is certainly possible, but we'd probably need the blessing of @ubitux I guess, since these are changes that affect not only the rating code, but also other parts.
While we're at it, I think the following things would also be nice:
The rankings you have with the parameters you suggested looks pretty nice. I think your suggestion of Glicko2 would be an improvement compared to the TrueSkill (and the current Glicko2) that we have today. Input from @ubitux and other competitive players would also be nice though.
Three other points:
skill - 3 * uncertainty
; they just display skill
(and optionally use the RD as a "band" around the estimated skill). Though I might be wrong since it's been a while I read the paper and I didn't read through it properly before commenting..Thanks for the detailed reply!
- Update the ratings at the end of the week. Currently, the ratings are updated on a per-game basis, so we'd have to change the code somewhat to support this.
Yes - I think for this the main change is to store the rating variables for the end of the last period and the provisional rating variables for now. In the simple approach then the variables are only updated after each game, which means the "last period" variables are updated after the first game of the next period (i.e. slightly late). An advantage of this is that it would be easy to display the previous period's ranking.
Highlight the RD on the graph...
I like this suggestion, but I'm proposing that the main value shown on the graph is skill - 3*RD
. We could also show an error bar upwards with distance 3*RD
. (I wouldn't be against showing it up 6*RD
instead - to show the upper bound on the skill level confidence interval)
Hide inactive players...
I made a very minor change (fix?) to the Glicko2 algorithm to cap the RD at the initial RD. This was the case in the Glicko paper, but not in the Glicko2 paper. I think it's an oversight (or I'm doing something wrong), because otherwise the RD of an inactive player can exceed the RD of an unknown player. Assuming we continue to cap the RD then we could hide players who are at the maximum RD which is currently only five players.
Set an upper limit on the rating difference to avoid farming. (See: "Rating gap cap": http://aligulac.com/about/faq/)
I did try to simulate this, by having Clockwork beat me ten times a week. His RD did not improve significantly. The link you reference seems to suggest that a rating gap cap is a bad idea (maybe I'm reading it wrong?)
The original paper suggests a rating period so there's an average of 10 - 15 games per player. I think 1 week is sufficient for some active players but insufficient for most other players...
I agree that it's probably not sufficient for 10-15 games per player, but my simulations seemed to show it worked ok with the current number. Even if we picked a longer rating period then some players will be playing far more than 10-15 games and others far fewer. I'm not wedded to the 1 week period, but it seemed a good compromise that would allow most players to be active while still showing regular rating changes.
I don't think "original" Glicko 2 displays skill - 3 * uncertainty; they just display skill (and optionally use the RD as a "band" around the estimated skill).
You're remembering correctly. The paper talks about ±2*RD
being used for the 95% confidence interval, so skill - 3*RD
would be the lower bound on some confidence interval (I think 99.7%). I really like using the lower bound on a confidence interval for ranking because it rewards players who prove themselves, and helps avoids "lucky" players being above consistent players who have similar skill levels.
The link you reference seems to suggest that a rating gap cap is a bad idea (maybe I'm reading it wrong?)
Oh, it seems you are right on this! Yeah, perhaps we should not have the rating limit, then. Also nice to hear you already experimented with it and saw that Clockwork didn't gain much from farming.
I made a very minor change (fix?) to the Glicko2 algorithm to cap the RD at the initial RD.
Yes this makes sense! Thought I also checked for this in the code, but apparently not :stuck_out_tongue:
I really like using the lower bound on a confidence interval for ranking because it rewards players who prove themselves, and helps avoids "lucky" players being above consistent players who have similar skill levels.
Also a nice way of thinking about it. In any case, if we decide to include the error bars it should be easy to test out both displayed_rating = rating, error = [-3RD, +3RD]
and displayed_rating = rating - 3 RD, error = [0, +6RD]
and see which makes sense. If we choose to display rating - 3 RD
then it would probably be helpful to add some explanation in the ladder FAQ on the rationale for using it.
Ok, so I guess the change could be something like this (just trying to make a quick summary of what the changes would be):
Assuming...
Then...
rating - 3 * RD
[0, 6 * RD]
Does this sound OK? And would it be acceptable for @ubitux ? I guess we'd need help on the display part from Ubitux (at least I have no idea how to do that)
I promised graphs, so here are a few.
@kvalv I think your summary sounds good. One potential change:
Leaderboard only displays the most recent rating period (e.g. on a Saturday the rating of a player is the same as they had on Monday)
I think for this we could display the current rating period with ratings as if the current period has ended (the whole period would need recalculating after every match, but it's not particularly intensive). It would be good to be able to display the previous rating period too (and maybe historical rating periods if we can design a suitable UI).
Yes, I guess that's possible :+1:
Then we would also have to display this "next rating" on the player profile?
Example (sorry for the mess):
|--- week 1 ----|----- week 2 ---- |
|<games...> | <games...> |
^ ^
rating=1000 / \---TODAY, rating=1200
After week 1 I get a rating of 1000. We are in middle of week 2, and my current "next rating" is 1200. Because the leaderboard says 1200, it also makes sense that it says 1200 in my profile. So...
Should it say 1200 or 1000? The current "correct estimate" rating is 1000, and 1200 is only an estimate of the next estimate. So I would say 1000 is correct, but 1200 fits well with what the leaderboard says :stuck_out_tongue:
This has come up a few times, but I thought I'd put down a strawman proposal for a new ranking algorithm using Glicko2. First I think it's useful to cover my understanding of the existing algorithms and the state they store:
skill - 3 * uncertainty
which is useful to ensure players need to play a good number of games before they can reach the top.I think that the current implementation of Glicko2 which we have in the ladder suffers from a couple of problems:
skill
rather thanskill - 3 * uncertainty
. This causes the middle of the table to be filled by players with a single game. This is easily fixed though.I initially thought that by introducing rating periods then it would be simple to game the system by only playing bad players. For example a player at the top of the table might be able to keep their uncertainty down by playing players at the bottom of the table. Having simulated this I'm happy that this is not the case. Games where the skill difference is high (and with the expected result) have a much smaller effect on uncertainty than when the skill levels are about equal.
This all said, my strawman proposal is to use Glicko2 with 1 week rating periods, initial skill of 1500, initial RD (uncertainty) of 350 and initial volatility of 0.3. There's also an internal constant tau and I've been using 0.5 for this. I propose ranking players based on
skill - 3 * uncertainty
as we do at the moment.+Notes+
Note that the rating period should not be confused with a reset period. Personally I think the reset period would become redundant with my proposal as inactive players fall from the top ten within a few weeks. If we want to display two ladders still, then maybe an equivalent of the "all time" ladder would be to rank players based purely on their estimated skill (rather than
skill - 3*uncertainty
).The updates to rankings would happen at the end of each week, although there's nothing to prevent us calculating this after every game (using the values from the start of the week) to show what the ladder "currently" looks like.
Using ranking periods makes it hard to give a "per game" delta of rating points. We could calculate the ladder rankings using the results excluding and including each game to find a delta though. If we displayed this then we might also want to show the end of week changes (particularly for inactive players), since their increase in uncertainty is unrelated to any particular game.
I don't think it would be unreasonable to keep the results of games for "all time." We wouldn't be able to keep all replays, but these become less useful when new releases of OpenRA are made. I think it wouldn't be prohibitive to keep the end-of-period rankings too, in which case we'd be able to show each players highest position, and how long they kept it for. Perhaps for a fun alternative to the two-monthly ladder, players could see who could hold the number 1 spot for the longest.