sublee / trueskill

An implementation of the TrueSkill rating system for Python
https://trueskill.org/
Other
742 stars 112 forks source link

Tied teams with same initial ratings have slightly different ratings after trueskill calc. #13

Open bernd-wechner opened 7 years ago

bernd-wechner commented 7 years ago

This was a subtly observation while testing a site I'm building. Here is some python code to demonstrate it:

#!/usr/bin/python3
import trueskill
mu0 = 25.0
sigma0 = 8.333333333333334
beta = 4.166666666666667
delta = 0.0001
tau = 0.0833333333333333
p = 0.1
TS = trueskill.TrueSkill(mu=mu0, sigma=sigma0, beta=beta, tau=tau, draw_probability=p)
oldRGs = [{10: trueskill.Rating(mu=25.000, sigma=8.333), 11: trueskill.Rating(mu=25.000, sigma=8.333)}, {8: trueskill.Rating(mu=25.000, sigma=8.333), 3: trueskill.Rating(mu=25.000, sigma=8.333)}, {9: trueskill.Rating(mu=25.000, sigma=8.333), 6: trueskill.Rating(mu=25.000, sigma=8.333)}]
Weights = {(1, 8): 1.0, (1, 3): 1.0, (0, 10): 1.0, (2, 9): 1.0, (0, 11): 1.0, (2, 6): 1.0}
Ranking = [1, 1, 2]
newRGs = TS.rate(oldRGs, Ranking, Weights, delta)
print(newRGs)

When I run this it produces:

[{10: trueskill.Rating(mu=26.804, sigma=7.250), 11: trueskill.Rating(mu=26.804, sigma=7.250)}, {8: trueskill.Rating(mu=26.808, sigma=7.249), 3: trueskill.Rating(mu=26.808, sigma=7.249)}, {9: trueskill.Rating(mu=21.387, sigma=7.576), 6: trueskill.Rating(mu=21.387, sigma=7.576)}]

In summary:

3 teams compete. 2 teams tied for first place. Players are identified by a number Team 1 has players 10 and 11 Team 2 has players 8 and 3 Team 3 has players 9 and 6 The partial play weights are all 1 and all players start with trueskill.Rating(mu=25.000, sigma=8.333)

Given teams 1 and 2 tied, I expect the trueskill rating for the players 10, 11, 8 and 3 all to be updated identically. And yet, after running trueskill.rate we would have players 10 and 11 complaining that their rating is now:

trueskill.Rating(mu=26.804, sigma=7.250)

while players 8 and 3 with whom they tied now have:

trueskill.Rating(mu=26.808, sigma=7.249)

I expect there is a math precision issue at play. But the integrity issue remains, that tied teams expect identical trueskill updates if starting from the same skill.

I'm not sure that expectation holds when they have disparate initial skill ratings and so in practice this may never be noticed. But ti does raise two questions:

1) What causes it? 2) What should be done about it?

sublee commented 7 years ago

The official TrueSkill calculator by Microsoft Research also makes exactly same result.

image

I'm not sure why, but I guess there's some weakness in the factor graph. Anyway I think we can't fix this problem without an enhanced TrueSkill algorithm.

bernd-wechner commented 7 years ago

Looks like the MS implementation is true to the fault.

I can think of one provisional solution if we call it a flaw in the algorithm. Namely merge all tied teams into one team before proceeding and split them out again before returning. So imagine:

Team 1 tied with Team 2 for first place. Team 3 came second Team 4 tied with Team 5 for 3rd place

Compress this to three teams that came respectively 1st, 2nd, and 3rd, continue, and when complete, split the players back into their original teams for returning to called.

All duly flagged as a workaround for a known bug in the algorithm that produces slight differences for tied teams. Just a thought. I can implement this as a wrapper myself. Or integrate it and send you pull request.

It's not a major issue as it's a very small discrepancy probably only visible when players all have same ratings to begin with. But who knows?

sublee commented 7 years ago

In my opinion, merging tied teams would not fix this problem.

Let's suppose the first case you said. There are 3 teams: A, B, and C. The rank of team A and B is 0, of C is 1. In this case, C has lower risk in (A+B < C) which is your solution than (A = B < C) which is the default behavior. Because C has less members than A+B.

See this example:

In [1]: from trueskill import *

In [2]: a, b, c, d, e, f = [Rating() for x in range(6)]

In [3]: rate([(a, b), (c, d), (e, f)], ranks=[0, 0, 1])
Out[3]:
[(trueskill.Rating(mu=26.805, sigma=7.251),
  trueskill.Rating(mu=26.805, sigma=7.251)),
 (trueskill.Rating(mu=26.808, sigma=7.250),
  trueskill.Rating(mu=26.808, sigma=7.250)),
 (trueskill.Rating(mu=21.387, sigma=7.577),
  trueskill.Rating(mu=21.387, sigma=7.577))]

In [4]: rate([(a, b, c, d), (e, f)], ranks=[0, 1])
Out[4]:
[(trueskill.Rating(mu=25.126, sigma=8.283),
  trueskill.Rating(mu=25.126, sigma=8.283),
  trueskill.Rating(mu=25.126, sigma=8.283),
  trueskill.Rating(mu=25.126, sigma=8.283)),
 (trueskill.Rating(mu=24.874, sigma=8.283),
  trueskill.Rating(mu=24.874, sigma=8.283))]      
bernd-wechner commented 7 years ago

Hmmm, that sure is interesting. It does seem to solve the problem of uniform (unbiased) handling of all those who drew but as you note it seems to reduce the impact of the match. The equivalent of reducing Beta and raising Tau it seems (reducing the impact of this match on the ratings on the whole). I'm guessing you understand the algorithm better than I.

Might be an idea to throw this question at MS-Research. Am curious if it's a design issue or a precision issue. Either way it break an expectation with draws.

sublee commented 7 years ago

A and B also got less benefit because of their mass. So I don't think it's about impact of the match. It's about fairness among the teams.

I vote for "a design issue". I'm guessing to solve it with a right way, we have to re-design the flow in the factor graph.

But I'm also not more expert than Microsoft Research. So I agree it's better to ask them.

bernd-wechner commented 7 years ago

Yep, a design issue. Chatted with TrueSkill project lead, and one problem/issue with design is that ties will only give the same TrueSkill impact within the draw margin (his words). I think that means the probability of a draw, but I'm still discussing. He said could be fixed by reworkng the factor graph. But getting too deep for me now and I'd need to go do some reading and learning. So for now, will have to accept this property of TrueSkill.