Ratings system improvements

lightvector commented 8 years ago

Implement a better ratings system. Something WHR-based would be nice, including for retroactively taking into account games unrated after being played. This would probably involve some work thinking about how to structure the back-end update process. And some thought on how to deal with unusual user behaviors, how bots should differ, globals ratings anchors and adjustments, etc.

mattj256 commented 8 years ago

It sounds like we need metric(s) of what constitutes a good rating system. For the record, there was a user on Arimaa.com who recently single-handedly changed the ratings of a number of bots by more than a hundred points each by losing repeatedly to a weak bot, then winning repeatedly against stronger bots. One possible metric is the ability of a single user to have a disproportionate impact on another user's rating. One metric is that the rating system should be not too CPU-intensive to compute. One metric is that the player's ratings should tell you something about the probability that one of them will win against the other. One metric is that the ratings should be stable over time: the strength of a 1400 player today should be comparable to the strength of a 1400 player last week or last year. Also if a player wins repeatedly against a very weak opponent, should the winner's rating rise arbitarily high or stabilize at some value?

There's also a policy element here: the Free Internet Chess Server has explicit policies prohibiting certain types of cheating and rating manipulation. (For example, rules 12, 13, and 15.) http://www.freechess.org/Help/HelpFiles/abuse.html

For what it's worth I would say why not just implement plain old regular WHR? If you want to allow for retroactively unrating a game, I don't know if there's a better solution than periodically recomputing every single player's rating.

clyring commented 8 years ago

FWIW, also look at the line notes at https://github.com/lightvector/arimaa-server/commit/ded9d1aa743a2d594e2e60affd44b37227964d66 for some more thoughts on the computational end of this. Of your metrics:

Disproportionate impact: Regarding server bots, this should be automatically much more difficult once a better system is in place for them because their strength is constant or only hardware-dependent for most current bots, meaning that their ratings will effectively depend on a larger set of results and hence be less sensitive to any particular result. From the perspective of automatically mitigating the effects of abusive behavior, there are a number of things that can be tried, but these can be looked into more closely later once we have a fairly solid foundation for the ratings.
CPU-intensity: This is more a requirement than a measure. Anything we deem too CPU-expensive simply will not be used, regardless of other redeeming attributes.
Predictive quality: This is one of the main and easiest measures of rating system quality.
Stability over time is a hard thing to guarantee, but should be easier when we can actually guarantee that the rating of fixed-strength bots are actually... fixed... for the sake of performing rating calculations, and we have at least some information on the new players entering the system.
The default behavior in most systems without strange-looking ad-hoc adjustments is for a player who only plays won games against a fixed opponent to slowly diverge to +infinity in rating. (The arimaa.com ratings get around this only because the gameroom ratings are stored only to the nearest integer and never retroactively corrected for.) Possibly this behavior will change as part of whatever we eventually decide to do for 'mitigating the effects of abusive behavior.'
I do think we should have some word on abusive behavior somewhere in the official site policies.
Plain old regular WHR is what we are implementing as a second step after the currently implemented system, which is closely related to the Glicko system.

As I have a lot of personal experience working with ratings-related systems, this is probably something I will spend a lot of time tinkering with down the road.

Another consideration to keep in mind in design of the rest of the site to facilitate later rating system improvement: "Have at least some information on the new players entering the system." In particular, we will probably want to handle each of the following cases differently:

Bots:
- Server bots with fixed strength
- Server bots with hardware-dependent, but otherwise fixed strength
- Other bots (Perhaps with a prior strength estimate?)
Humans:
- First-time human beginners.
- Human players with some, but limited, off-server experience, and possibly no prior strength estimate. (i.e. I would support a question on the registration page: "Approximately how many games have you played?")
- Human players with significant off-server experience and a prior strength estimate. So probably it would be best to have tracking from the beginning of each of these things.

mattj256 commented 8 years ago

I'm out of my league here. I understand Glicko and WHR conceptually, but not well enough to implement them myself. And I definitely don't understand the math.

When I used to play games on Yahoo Games, I remember your rating was marked as "provisional" until you had played a certain number of games. If I remember correctly they didn't publicly display the ratings for provisional players. (This could avoid the problem clyring mentioned where a new player with high uncertainly splits his first two games against the same opponent and his rating jumps by a large amount.)

If a new player is strong, it could be useful to allow them to start with a high ranking provided they meet some criteria like defeating a few high-rated bots or solving a few tactics puzzles. This is similar to colleges that have a foreign language requirement and allow you to place out of the requirement by demonstrating proficiency.

I could imagine setting things up so that for the first X games a new player's rating is updated normally, but the opponent's rating change is calculated retroactively after the system has a better estimate of the new player's true strength. (I don't know if that's a good idea or not.)

lightvector / arimaa-server

Ratings system improvements #114