sublee / trueskill

An implementation of the TrueSkill rating system for Python
https://trueskill.org/
Other
742 stars 112 forks source link

What's the workflow? #45

Open ughstudios opened 3 years ago

ughstudios commented 3 years ago

I'm trying to understand the workflow for this true skill tool.

The way I imagine doing matchmaking is my master server will attempt to find players who are already in a lobby with ratings similar to mine, within some sort of offset. However, the first question here is, what is the fair range of offsets for skill level? +50/-50? Second, what type of data should I be adding to my database tables for this to be able to be saved?

Is the workflow like this:

  1. create match
  2. Find a winner (let the game play out)
  3. ?
  4. Submit results to db

Do you have an example workflow of how this could be used?

From reading the official true skill documentation, it says that for a 4v4 2 team game, you would need a minimum of 46 matches to determine with a high degree of belief that this is the players score. Where is the number of matches stored in your code?

Should my database store both the Sigma and Mu values of each player? The official calculation I found online for calculating player skill is this: μ – k*σ where k = 3. I found this on the official microsoft page.

ughstudios commented 3 years ago

Don't we need to tell trueskill how many wins, or losses a specific player has? How do I do this when performing the rating recalculation?

ughstudios commented 3 years ago
    trueskill.Rating(mu=25.000, sigma=8.333)
    trueskill.Rating(mu=25.000, sigma=8.333)
    trueskill.Rating(mu=29.396, sigma=7.171)
    trueskill.Rating(mu=20.604, sigma=7.171)

    def start(self):
            trueskill.TrueSkill(backend='scipy').cdf
            r1 = Rating()
            r2 = Rating()

            t1 = [r1]
            t2 = [r2]
            print(r1)
            print(r2)
            (new_r1,), (new_r2,) = rate([t1, t2])
            print(new_r1)
            print(new_r2)

Why is the output from my above code modifying the sigma value when I've not done anything? Don't you need to tell it who the winner was??

ughstudios commented 3 years ago

What is the "ranking table?"

ughstudios commented 3 years ago

In order to make a leaderboard, wouldn't we need to have some sort of way to compare each players score against each others? These all seem independent and related to a single match. However, we need to be able to determine skill by each and every player so we can match them with players of similar score...

bernd-wechner commented 3 years ago

Don't we need to tell trueskill how many wins, or losses a specific player has?

No. I suggest you read up on TrueSkill. I have no idea why you're trying to implement it with such a poor understanding of what it is and does. No offense intedended, just genuinely curious. Here is an implementation of it:

http://trueskill.info/

it's a simple form that calls this library and reports results. And the code is here for your reference:

https://github.com/bernd-wechner/TrueSkill-Server

To help you out a a little: TrueSkill is a Bayesian model that, given an estimate of each players skills (modelled as a Gaussian with a mean and standard deviation) and the result of the game (taking into consideration only ranking, namely who came first, second, third etc, not scores or any other measures) will return a new estimate of skills for each of those players.

That aim is, to use the outcome of one game, to update the estimate of skill we have for each of the players.

It's because you seem clearly not to know this, that I wonder why you are here playing around with the TrueSkill Python implementation. Heungsub has written rather excellent documentation here:

https://trueskill.org/

precisely to answer all these questions you have.

Why is the output from my above code modifying the sigma value when I've not done anything?

Again, RTFM and you'll be fine. But to help out a little when you provide the list [t1, t2] you are providing the rankings in a game, namely stating that t1 played a game against t2 and t1 came first and t2 came second.

TrueSkill then calculates new ratings (skill estimates) based on the old ratings (skill estimates) given that outcome. The expected outcome is:

  1. The rating mean (mu) of the players in t1 goes up
  2. The rating mean (mu) of the players in t2 goes down
  3. The rating standard deviation (sigma) of all players goes down

Why? Again, read up about TrueSkill the web is full of documentation. But in short:

  1. Because Mu measures most likely skill and t1 beat t2 so the players in t2 are probably better than the players in t2, so mu for players in t1 goes up and the mu for players in t2 goes down.
  2. This game result constitutes one piece of evidence for the fact that t1 is better than t2. Because of this our confidence in these expected values (Mus) is higher. Standard deviation is the inverse of confidence, it measures uncertainty. We can say unequally that our uncertainty about the the players skills is now a little lower, so their standard deviations (Sigmas) go down a little.

If you want leaderboards, yes you track player ratings. Every time they play a game you update those ratings. The TrueSkill rating is typically µ − (µ0 ÷ σ0) × σ and explained well enough here:

http://trueskill.info/help.html

Then yes, you match players of comparable ratings with one another.

Hope that's some help. But this is not a TrueSkill help forum, it's just a Python package kindly written Heungsub Lee and gifted to the Python community.

ughstudios commented 3 years ago

Thanks for the response. I appreciate it.

Then yes, you match players of comparable ratings with one another.

Yeah that makes sense. So I would match people who are within a range with some sort of allowed deviation.

Out of curiosity, the stuff I've read so far is the following: https://www.microsoft.com/en-us/research/project/trueskill-ranking-system/?from=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fprojects%2Ftrueskill%2Fcalculators.aspx https://trueskill.org/ http://www.moserware.com/2010/03/computing-your-skill.html

Are these good resources that I've been using so far?

I guess I'm trying to understand how I should represent the players score to the player. A lot of games use ELO, for example, StarCraft 2 has an MMR rating value which is displayed to you. What value would I be displaying to players on a leaderboard so they know where they stand? The calculation of trueskill values is a bit difficult because there is two values, the µ and the sigma value. The microsoft site says

image

Where the rating should be displayed as : mu - (k * sigma)

but this would be a small value, some number between 0 and 25, which I don't feel gives enough of an understanding for a player. Also, the trueskill only becomes accurate after 46 games for a 2 team 4v4 game according to the official website.

image

So a player who just starts out will appear much better than they are. Also, the "place" in terms of that ranks variable, is that basically "first place" and "second place" where 0 is a winner, and 1 could be a loser?

I know you mentioned that this is not a help form, but you have been very helpful. Do you know where I could go to ask additional questions?

I was playing around with the website you made and noticed this: image

Why does the rating go down to 0 for the second player?

bernd-wechner commented 3 years ago

mu - (k * sigma) is exactly what I suggested as µ − (µ0 ÷ σ0) × σ where k = (µ0 ÷ σ0)

And the default value of k proposed by TrueSkill's authors is 8.3333333 as can be read in the paper or is described here:

http://trueskill.info/help.html

The paper is not needed unless you're keen on papers and math, and the references you have are all excellent.

This creates a default, or initial rating for all players of 0. Ratings on our leaderboards are hovering around the 20 mark for the top of the leaderboard currently.

but this would be a small value, some number between 0 and 25, which I don't feel gives enough of an understanding for a player. Also, the trueskill only becomes accurate after 46 games for a 2 team 4v4 game according to the official website.

Not it does not guarantee 0 to 25 and that this is trivial to change but, consider that the rating is a real number not an integer. For example on one leaderboard on of ours ratings range from -2.9 to 24.7. We choose to display them to one decimal place. The leaders rating might well be 24.68756342 for example.

If you want to change the range you can tune all of the TrueSkill parameters that are again well described here:

http://trueskill.info/help.html

(in fact Beta should be tuned over time from recordings though the authors fail to document how and I have yet to get a response from them on that subject).

But it is child's play to scale the existing rating after the fact as you wish. You can multiply it by 4, or 10 or whatever and present as an integer rather than a real number.

Why does the rating go down to 0 for the second player?

It didn't go down to 0 it was 0 to begin with, it just didn't go up. Old rating 0, New rating 0, Change 0.

Note that ratings can and do go negative. That's because all players start with a rating of 0 and when unknown players compete, some players' ratings go up, and the other players' ratings go down. I've never seen ours drop below about -4 ... In a 2 player game they don't go negative, in a six player game the loser will go negative.

You can certainly play around with that site and in fact interact with it using URL only (as an API) to get a feel for how TrueSkill behaves. Good for debugging as well ;-). And the code is all open and you can host it yourself too for fun ... all possible.

Bear in mind TrueSkill is a numeric implementation of a Bayesian learning system for random variables ... nice and wordy what? The point being that it is not a simple algorithm and its results and behaviours can take some interpretation.

It's not without bugs though, notably:

  1. Ties are not modelled well at all. If you experiment with say a four player game with total newbs (all 0 ratings), and have tied (finished equal first) they should have identical rating updates. But they don't. The algorithm doesn't guarantee this alas and it's a known bug of low consequence.

  2. It models skill decay very poorly. It's functional in a way for computer game that rack up a lot of plays with fairly short and/or consistent time between them. But it's a shockingly poor model. That is discussed here: https://github.com/bernd-wechner/CoGs/issues/13

An inescapable property of the TrueSkill algorithm and all Bayesian algorithms and in fact any sensible skill modelling system, is that the results are not order independent. That is, the order of game results impacts ratings significantly. Again of low concern to computer gamers as the computer logs the game and time and can submit results and get rating updates. For games where the results are entered manually into a rating system, then there is risk of order betrayal ... that is getting the order wrong.

Why is this relevant? Because this:

Jack beats Jill Jill beats Jack

gives Jack and Jill different ratings to:

Jill beats Jack Jack beats Jill

This is a natural property of sensible skill modelling (and of ELO and TrueSkill bich are attempts at that applying Bayes Theorem for updates). But on any manual entry system has some dire consequences in terms of data integrity.

If you want TrueSkill help though I'd strongly recommend StackOverflow over GitHub issues. The Authors themselves are known to help out there. For example:

https://math.stackexchange.com/questions/4112262/trueskill-the-probability-of-a-predicted-or-observed-draw/4147013#4147013

I have yet to reply to Tom there ;-). But cherish his response. It's good sample of where too much delving into the math behind TrueSkill will lead you (I'm exploring draw modelling).

bernd-wechner commented 3 years ago

I'm guessing you can close this issue now?

fsmosca commented 2 years ago

In order to make a leaderboard, wouldn't we need to have some sort of way to compare each players score against each others? These all seem independent and related to a single match. However, we need to be able to determine skill by each and every player so we can match them with players of similar score...

Here is a sample leaderboard after 3 games from 3 players. It features the conservative rating or minimum rating. It also shows the maximum rating and number of games.

The rating column is the mean or average rating. By looking at the minrating and maxrating, one can estimate that a pairing is not too bad. If minrating of strong player is above the maxrating of the other player then the pairing is bad. The players will just look the minrating and maxrating to see his limits.

output

TrueSkill Leaderboard
     name     rating  uncertainty  minrating  maxrating  games
1    John  25.393947     5.624126   8.521569  42.266326      2
2   Peter  25.045753     6.265002   6.250747  43.840760      2
3  Sandro  22.678275     5.525785   6.100921  39.255629      2

uncertainty_factor=3

code

"""
sample.py
"""

from trueskill import Rating, rate_1vs1
import pandas as pd

def main():
    UNCERTAINTY_FACTOR = 3

    p1 = Rating()
    p2 = Rating()
    p3 = Rating()

    p1g, p2g, p3g = 0, 0, 0

    # Report game result from 3 games.
    p1, p2 = rate_1vs1(p1, p2)
    p1g += 1
    p2g += 1

    p2, p3 = rate_1vs1(p2, p3)
    p2g += 1
    p3g += 1

    p1, p3 = rate_1vs1(p1, p3, drawn=True)
    p1g += 1
    p3g += 1

    # Build a rating list

    # Prepare data input.
    input_1 = {
        'name': "John",
        'tskill': p1,        
        'games': p1g,
    }
    input_2 = {
        'name': "Peter",
        'tskill': p2,
        'games': p2g,
    }
    input_3 = {
        'name': "Sandro",
        'tskill': p3,
        'games': p3g,
    }

    inputs = []
    inputs.append(input_1)
    inputs.append(input_2)
    inputs.append(input_3)

    # Create a dictionary of player data.
    dic = {}
    for i in inputs:
        dic.update({i['name']: i})

    pdata = []
    for n, v in dic.items():
        name = n
        mean_trueskill = v['tskill'].mu
        uncertainty = v['tskill'].sigma
        games = v['games']

        min_trueskill = mean_trueskill - UNCERTAINTY_FACTOR * uncertainty  # conservative
        max_trueskill = mean_trueskill + UNCERTAINTY_FACTOR * uncertainty
        pdata.append({'name': name, 'rating': mean_trueskill, 'uncertainty': uncertainty,
                      'minrating': min_trueskill, 'maxrating': max_trueskill, 'games': games})

    # Convert pdata to dataframe to easier handle the table.
    df = pd.DataFrame(pdata)
    df = df.sort_values(by=['rating', 'uncertainty', 'games'], ascending=[False, True, False])  # high rating first
    df = df.reset_index(drop=True)
    df.index += 1

    print('TrueSkill Leaderboard')
    print(df.to_string())
    print()

    print(f'uncertainty_factor={UNCERTAINTY_FACTOR}')

if __name__ == "__main__":
    main()