Implement an ELO rating system

menocar commented 8 years ago

Since we started storing game histories now it's relatively easy to implement an ELO rating system. We need to decide on the following items:

What rating categories we want to have? E.g., limited v.s. constructed.
Should we take into account all game results? We may want to skip casual games, or at least give them a smaller weight, so that we can experiment with new decks.
Where to show the rating information? I think we want one column for each rating in the user pane, in order to sort by them separately.

Let's think about what we want at minimum. The original data is available so we can change the definition anytime.

emerald000 commented 8 years ago

My main concern is how easily one can rank up by throwing matches. I might suggest looking into a ranking system that takes into account the rating deviation, such as Glicko. It notes how accurate your rating currently are and adjust ratings accordingly. For example, if you lose against a new account with an established account, your rating will barely move as we have no idea what the real strength of the new account is.

Also, it would be interesting to have two different game types: rated and unrated. I guess this could replace the current Beginner/Casual/Serious types which aren't really being used.

LevelX2 commented 8 years ago

Also, it would be interesting to have two different game types: rated and unrated.

I agree that it should be possible to create tables without rating.

I guess this could replace the current Beginner/Casual/Serious types which aren't really being used.

So you can't signify you want serious testing without rating. Hmm I'm not sure about this. I think it would be better to simply add a new checkbox []Rated and to keep the types.

LevelX2 commented 8 years ago

What rating categories we want to have? E.g., limited v.s. constructed.

What about group to limited / modern / standard / anything else?

Should we take into account all game results? We may want to skip casual games, or at least give them a smaller weight, so that we can experiment with new decks.

We should add an option to table create if the table matches are rated or not.

Where to show the rating information? I think we want one column for each rating in the user pane, in order to sort by them separately.

I would say each rating is a new column in the user table. Also reported by a new command line option e.g. /Rating [Username].

fireshoes commented 8 years ago

If you are planning on having each player's rating as public, having a checkbox in preferences to make your rating private might be nice.

menocar commented 8 years ago

emerald000, thanks for the info! I might implement Glicko if I can figure out how to do it properly. I checked it and there's something not obvious for us, e.g., when we should update ratings (see https://www.npmjs.com/package/glicko2). Do you have any experience with it? BTW ELO rating is also resilient to cheating to some extent, because in order to increase your rating significantly you need to win against a high rating player.

LevelX2, adding a separate option about rating sound good. If we implement those 4 ratings (limited / modern / standard / anything else) do we want to show all of them in the user table? I'm thinking we can start from showing 1~2 most important ratings for simplicity. Do you know what's shown in MTGO?

fireshoes, we might want to add the feature if users are vocal about it. I think we can make use of ratings for game matching, e.g., set a rating restriction on tables, so exposing it publicly makes things easier.

fireshoes commented 8 years ago

I think mtgo just shows limited/constructed/overall rating, so I like one-upping them. ;)

quercitron commented 8 years ago

My main concern is how easily one can rank up by throwing matches.

One approach to deal with this is to decrease rating change if multiple games with the same result were played between same players in a short period of time.

emerald000 commented 8 years ago

emerald000, thanks for the info! I might implement Glicko if I can figure out how to do it properly. I checked it and there's something not obvious for us, e.g., when we should update ratings (see https://www.npmjs.com/package/glicko2). Do you have any experience with it? BTW ELO rating is also resilient to cheating to some extent, because in order to increase your rating significantly you need to win against a high rating player.

I implemented Glicko a while ago for an online multiplayer game and didn't have issues with continuous updates. If it ends up being an issue, we can use periods of a couple of hours to a day. There are a couple of parameters to play with, but they are not critical.

My main concern is how easily one can rank up by throwing matches.

One approach to deal with this is to decrease rating change if multiple games with the same result were played between same players in a short period of time.

It is better if the fix is inherent to the mathematical rating system rather than something you hack into it. There's less chance that way to have some kind of systematic bias or to find a way around it (playing with a bunch of different accounts in this case for example).

A thing that would be interesting once we get ratings is matchmaking. You could join a queue and you get paired up against someone of similar strength. That also reduces the possibility of collusion.

Another interesting bonus to Glicko is that it encourages players to play more to increase their ratings rather than creating a new account. I'll try to explain relatively simply. (Math warning)

Let's suppose you start at 1500 ± 500. 1500 being the rating and 500 being the standard deviation (SD) on the rating. The lower the SD, the more accurate your rating is. Your SD goes down as you play games (we have more information) and it goes up slowly over time (we don't know if you get better or worse if you don't play).

The hard thing is comparing two different Glickos, since they are composed of two numbers. The usual way to do so is to create a matchmaking rating (MMR) with a very conservative approach. We can create an interval [a, b] in which we are 99.7% sure the real strength of the player lies. To do so, we create an interval from {rating - 3 times SD} to {rating + 3 times SD}. With the starting parameters, this would create the interval [1500 - 3_500, 1500 + 3_500] = [0, 3000]. We now take a very conservative approach and take the lower bound of the interval (in this case 0). This is now your MMR. This makes it over 99.7% likely your real strength is at least your MMR.

An interesting side effect is that your MMR tends to go up when you start even if you lose games. You may go from 1500 ± 500 [0, 3000] to 1480 ± 470 [70, 2890], which increases your MMR by 70 even though you lost the game. The related downside is explaining to users why their rating went up even if they lost.

... And that took way longer that I hoped.

menocar commented 8 years ago

emerald000, thank you for the detailed explanation! I'm now sold on Glicko rating. I read the original paper and I figured the implementation itself is not that complicated. Hopefully I can make a change this weekend.

I have a question hopefully you can answer. I get the idea of MMR. The question is when we should use it.

When we say "this person has a rating X", do we want to use MMR? I'm assuming not, we should use the center, because that value is the most likely true rating.
How can we use MMR for matching? My guess is to compute X% confidence intervals of both users and make a match if those intervals overlap. Is this correct?

emerald000 commented 8 years ago

Implementation is pretty straightforward. Just a bunch a formulas you need to apply. Like I said, there are a couple of parameters to adjust, but that's not a huge issue.

When we say "this person has a rating X", do we want to use MMR? I'm assuming not, we should use the center, because that value is the most likely true rating.

If you want to use a single number, using the MMR is preferred. The center alone only gives half the information about your Glicko. If you want to show both values, you can either show center ± deviation or [low, high]. You can also use something like: 460 (1300 ± 280) The major issue is being concise while remaining clear to the users what the numbers mean.

How can we use MMR for matching? My guess is to compute X% confidence intervals of both users and make a match if those intervals overlap. Is this correct?

For matchmaking, we have two competing goals: match quality and waiting time. If there are only two people on the queue, we want to match them eventually even if they are quite apart in skill. A good way to do so is to create a range of "acceptable matchup" for each player. If it intersects another player's, pair them up. Then, over time, increase their range. This method tries to pair people with similar skill and increases the chance of being paired over time. You can also adjust the size of the range and the speed of the increase to calibrate the balance between the two goals.

menocar commented 8 years ago

Right, I get that the center alone doesn't tell the whole story (and I like the notion with ± a lot!). But if you compare 2 users, taking center seems to be the most fair way of doing it, because mathematically that should be the most probable answer. Or, maybe we shouldn't try to answer in that way, but instead should say like "this user is better than that user by X% chance". I think we can calculate that by computing integral of the normal distribution.

That said if MMR is what is commonly shown for Glicko ratings we should follow that. Do you know any article about that? Or maybe a list of rating systems doing that way?

Regarding match matchmaking, that's basically what I was thinking, thank you for the confirmation.

emerald000 commented 8 years ago

There is an issue of trying to map down 2D values into a 1D line. Glicko usually uses a 2sigma to 3sigma (I used 3sigma in my examples) below the mean. While you could use the mean, this changes a couple of properties like the fact that players are encouraged to play more to increase their ratings. If you use the mean, players hitting a all-time high due to luck would want to stop playing to keep their record. With a 3sigma, their rating would eventually decay down and force them to keep playing.

You can calculate P(A > B) by integrals, but integrals on normal distributions are extremely impractical. You can do a bit of mathematical manipulations to get a straightforward answer in function of Φ (CDF of a normal distribution of mean 0 and SD 1), which is slightly easier to compute, but is still somewhat impractical. This does give a better approximation, but I have never seen it used before.

A bunch of chess servers, CS:GO and Guild Wars 2 are using Glicko. TrueSkill (Xbox matchmaking system) uses a variation of Glicko adapted for multiple players in a match.

That being said, I don't think there is a best approach to the situation and it is mostly case by case. I think we should focus on making something that works then adjust it with the feedback we get.

quercitron commented 8 years ago

I think we can display conservative rating (MMR) to players: r - 3 * RD (r - rating, RD - rating deviation or standard deviation), but for matchmaking we can use expected rating r.

I can suggest next simple matchmaking pairing approach:

We have some players in matchmaking queue. Let's player i have raiting r_i ± RD_i. After he waits time t_i his "opened" interval is (r_i - t_iRD_i, r_i + t_iRD_i). When intervals of two players intersect we pair them and remove from queue.

I have never implemented such systems, so I think this approach is far from best. But at least the rule has no integrals. If there are some other approaches I'll be glad to read about them.

quercitron commented 8 years ago

I've read about conservative rating (mean - 3 * sigma) only in the article about TrueSkill rating system. I didn't find anything about it in the articles with Glicko formulas.

But I agree that conservative rating is very good in terms of motivation to play more. And the fact that many rating games uses conservative rating only confirms this.

By the way, as I understood such games usually use Glicko-2 system. Are we going to use Glicko or Glicko-2? From the article about Glicko-2 I didn't understand how rating can be updated after each match on the fly, without "rating periods".

emerald000 commented 8 years ago

We have some players in matchmaking queue. Let's player i have raiting r_i ± RD_i. After he waits time t_i his "opened" interval is (r_i - t_i_RD_i, r_i + t_i_RD_i). When intervals of two players intersect we pair them and remove from queue.

This is pretty much what I had in mind for matchmaking. Simple but efficient.

From the article about Glicko-2 I didn't understand how rating can be updated after each match on the fly, without "rating periods".

For "real-time" updates, you can set the rating period to a minute or so. Also note that we don't have to update the ratings of every single player in the DB every minute. We can update it when a player logs in, then after every game they played. This is good enough without straining the CPU.

By the way, as I understood such games usually use Glicko-2 system. Are we going to use Glicko or Glicko-2?

Glicko-2 is an improvement to Glicko-1, so we should try to use it if possible.

derFeind commented 8 years ago

If everyone has an ELO what about an automated que? Like Starcraft 2 Ladder press a button find a match go play.

marthinwurer commented 8 years ago

I don't think that a queue would be the best. I just don't think that there are enough people playing. However, if we do stick with the current system of creating a match and then others joining, we should hide the player's usernames so that people don't just figure out other player's decks and metagame against them specifically. Also, with adding so much to the users online panel, it might be a good idea to move it to its own tab. And if we do implement an elo system, having it sort by that by default would be pretty cool.

derFeind commented 8 years ago

@markedagain had you been playing on the eu server in the evenigns? I have no data to back this up but there is a ton of people playing ;-) I am pritty sure a queue would go well.

shootbot commented 8 years ago

I suggest 3 game categories: casual, for casual play practice, for people who want to playtest in competitive manner rated, serious games with ELO rating implemented

Also I suggest implementing queue system with automatic pairing for people with similar ratings for two reasons 1) convenience, it should be one button to play games (Play), not two (Create and Join) 2) protection from some rating abuses

rkfg commented 8 years ago

I'm mildly interested in implementing it, looks like fun. However, even the article about the Elo system mentions the failed MtG rating experiment. I read a bit about the Glicko system mentioned here, it seems to alleviate most of the issues with Elo but on the other hand it's much more complex.

If I'll be working on it, I'd implement simple Elo rating first and decouple the rating calculation and other logic so the implementation could be replaced easily with Glicko or whatever. For example, here is a Java implementation.

quercitron commented 8 years ago

I've started implementing rating. Glicko rating system (original one, not Glicko 2) is used. Currently three types of rating added:

General rating (it is updated after each rated game)
Constructed rating
Limited rating

Rated checkbox is added to the New Match and New Tournament creation options. Only duels between human players are rated.

Conservative approach is used for actual displayed rating: (rating_mean - 2 * rating_deviation). Initial rating mean is 1500, rating deviation is 350. So initial actual rating is 800. As was mentioned above, because of this approach we have interesting effects: on start rating can grow up even if you lose, and actual rating is decreasing over time when a user doesn't play (because rating deviation is increasing over time in Glicko).

Some questions about rating system:

Do we need general rating, or we need only separate constructed/limited ratings for now? Do we need more specific Standard, Modern etc ratings (we also should think where to display all this ratings)?
Should we rate short matches (shorter than minute or two)?
Should we rate Commander, Momir etc games?
Do we need config flag to turn rating on and off? It can be not so easy to implement because rating will be a part of interface, and interface is more or less hardcoded.

And some questions about interface:

How and what rating we should display in:
1. Players Chat Panel (list of all players on the right)? We have very limited space here, so should we display all ratings or only general rating?
2. Tables Panel?
3. Results Panel?
4. Table Waiting Panel? Should we display only constructed rating for constructed games and limited rating for limited games?
5. Inside the match - avatar tool tip update?
Should we add Rated games filer? How it should work, display only rated games or only unrated games?

Any commentaries/suggestions/questions are welcome.

LevelX2 commented 8 years ago

Do we need config flag to turn rating on and off? It can be not so easy to implement because rating will be a part of interface, and interface is more or less hardcoded.

Maybe we should only use games and tournaments with skill level "Serious". But a new flag would also be ok for me. Don't see the problem with the interface.

Do we need general rating, or we need only separate constructed/limited ratings for now? Do we need more specific Standard, Modern etc ratings (we also should think where to display all this ratings)?

I guess it would be ok to start not with all formats. If it works ok and a solution to display all the ratings is found, we could add more different ratings. I would say we don't need a general rating. May a genral win/loss ratio is enough.

Players Chat Panel (list of all players on the right)? We have very limited space here, so should we display all ratings or only general rating?

I would say yes. it's the most simple solution. We can simply add new columns, you can order der columns order and if we like we can add some preferences to allow the player to hide columns. Same true for Wating and Tournament players panels.

Tables Panel?

Should be enough at first to display the values at the players list.

Inside the match - avatar tool tip update?

Sure.

Should we add Rated games filer? How it should work, display only rated games or only unrated games?

If we add two litte filter buttons, "Rated" and "Unrated" both is possible.

quercitron commented 8 years ago

Maybe we should only use games and tournaments with skill level "Serious". But a new flag would also be ok for me. Don't see the problem with the interface.

Sorry, I was not clear here. I meant a server config flag that will turn on/off rating feature in the entire game. Not sure that we need it. I added Rated checkbox to the creation options. I think that it's better to use only this checkbox to determine if the match is rated, looks like the most obvious way for users.

I guess it would be ok to start not with all formats. If it works ok and a solution to display all the ratings is found, we could add more different ratings. I would say we don't need a general rating. May a genral win/loss ratio is enough.

Ok, only Constructed and Limited ratings will be displayed. I'll keep calculation of general rating, but it won't be shown anywhere. We can remove it completely later.

Same true for Wating and Tournament players panels.

In Wating and Tournament players panels should we display both ratings, or only rating that corresponds to the created game type?

If we add two litte filter buttons, "Rated" and "Unrated" both is possible.

Good idea.

Thanks!

ImperatorPrime commented 8 years ago

As https://github.com/magefree/mage/pull/1942 is merged, we can now close this, yes?

LevelX2 commented 8 years ago

Let's wait a bit, i haven't looked at the implementation.

LevelX2 commented 8 years ago

Is the implementation now using the existing saved games data to calculate the rating initial?

quercitron commented 8 years ago

No, in the current implementation all players have base rating initially (1500 - 2*350 = 800), previous games are not taken into account.

I'm not sure how "fair" it will be to calculate rating using old games results. I think that the user should know that the match he plays is rated, it can affect how he plays this match somehow.

But using existing games data to get initial ratings also makes sense. So if you think it's a better way it should not be hard to implement.

markedagain commented 8 years ago

i strongly suggest to let people know they are playing a rated game or else alot of our older users will be very unhappy. i have a good freind of mine that has a little over 600 games played, and he drops pretty often when his opponents just go afk on him

marthinwurer commented 8 years ago

Ratings should be based off of only rated games, and should start from a base. Not everyone wants to worry about their rating every time that they play a game. What if they just want to play a troll deck a few games? If it would tank their rating, then they would never do it. Everyone should start at the initial and go from there. The ratings will sort themselves out; that's what they're for.

quercitron commented 8 years ago

And what do think about displayed rating formula (rating_mean - 2 * rating_deviation) with 350 as max rating_deviation? At first I was thinking about (rating_mean - 3 * rating_deviation) with 500 max rating_deviation, formula that was suggested above. In this case initial rating would be 1500 - 3 * 500 = 0. But then I thought that maybe with rating_deviation 500 rating changes would be too big, and decided to use more standard 350. And then I changed rating_deviation factor from 3 to 2 because 800 looked better than 450 :)

shootbot commented 8 years ago

Now we need a leaderboard for this And I suggest to make 'Rated' a fourth skill choice (or maybe third with 'Beginner' removed) Because anything but Serious with Rated checkmark looks stupid And add a queue. For convenience and for protection from abuses (one player can observe what other is playing, then join his game with counter-deck)

drmDev commented 8 years ago

At this point I'm pretty sure this issue can be closed. A Glicko rating system has been implemented. Additional features or bugs related to it should be separate issues.

magefree / mage

Implement an ELO rating system #1498