sublee / trueskill

An implementation of the TrueSkill rating system for Python
https://trueskill.org/
Other
748 stars 114 forks source link

Inconsistent ratings when drawing #63

Open marcbr84 opened 1 day ago

marcbr84 commented 1 day ago

I'm using TrueSkill to try to create a rating system for a tennis tournament among my friends. Games are 1v1, so I'm trying out the following:

from trueskill import Rating, quality_1vs1, rate_1vs1
alice, bob = Rating(25), Rating(25)
print('#############')
print('No games')
print('#############')
print(alice)
print(bob)

alice, bob = rate_1vs1(alice, bob)
print('#############')
print('First game, winner alice')
print('#############')
print(alice)
print(bob)

alice, bob = rate_1vs1(bob, alice)
print('#############')
print('Second game, winner bob')
print('#############')
print(alice)
print(bob)

This outputs the following:

#############
No games
#############
trueskill.Rating(mu=25.000, sigma=8.333)
trueskill.Rating(mu=25.000, sigma=8.333)
#############
First game, winner alice
#############
trueskill.Rating(mu=29.396, sigma=7.171)
trueskill.Rating(mu=20.604, sigma=7.171)
#############
Second game, winner bob
#############
trueskill.Rating(mu=26.643, sigma=6.040)
trueskill.Rating(mu=23.357, sigma=6.040)

I would have expected either both players having the same rating after these two games or bob have a higher rank, since according to TrueSkill FAQ

TrueSkill always takes more recent game outcomes more into account than older game outcomes.

but let's gogo with that for now, that's another issue.

However, if I remove the second game and replace it with a draw and re-run the thing:

alice, bob = rate_1vs1(bob, alice, True)
print('#############')
print('Second game, draw')
print('#############')
print(alice)
print(bob)

I get the following:

#############
First game, winner alice
#############
trueskill.Rating(mu=29.396, sigma=7.171)
trueskill.Rating(mu=20.604, sigma=7.171)
#############
Second game, draw
#############
trueskill.Rating(mu=23.886, sigma=5.678)
trueskill.Rating(mu=26.114, sigma=5.678)

bob seems to have a better ranking when having drawn than when having won. Not only that, but also sigma is decreasing more than if there had been a winner, as it somehow was more significant a draw than a loss/victory.

What's going on here? What am I doing wrong?

bernd-wechner commented 22 hours ago

Sigma goes down with every new result. Every new result is more data, more information on which to base an assessment of skill. And sigma is the measure of if uncertainty in the rating and its job is to go down as more and more data, game results, are submitted. How far sigma moves has nothing to do with the result, being not a measure of the skill but of confidence in the measure.

It's also an ordered algorithm, meaning there's a difference between a lose, win and a win, lose sequence. Each and every submitted result is, based on existing player ratings and the outcome submitted.

True skill is not a scoring system or reward system if you like, it's a skill assessment system which provides more accurate assessments with more data and tries to reasons as skill varies over time, of its fed with evidence of that shift.

Check:

https://trueskill.info

In particular the help:

https://trueskill.info/help.html

marcbr84 commented 14 hours ago

Thank you very much for your answer.

I understand what Sigma represent and therefore why it goes down with every new result. What I was pointing out is that Sigma goes more down with a draw than with a win. As you can see in what I posted, if the second game is a win, Sigma goes from 7.171 to 6.040. If the second game is a draw, Sigma goes down up to 5.678. There may be a reason for that, but I'd like to understand how after one game, TrueSkill considers a draw more representative than a win. To me human, it seems off.

I know True skill is an ordered system, that' why I said that according to the FAQ I linked, it should assing higher mu to bob as a result of the sequence. Since bob lost the first game but won the second game and if I understand correctly what I posted the most recent result is given more prevalence, hence it should weight more than the first one, resulting in a higher mu than alice.

I'm just very confused by the output of such an easy example and I want to make sure I'm not doing anything wrong and/or that this is not any hint of a bug.