sublee / trueskill

An implementation of the TrueSkill rating system for Python
https://trueskill.org/
Other
748 stars 114 forks source link

How to tell TrueSkill that some 1v1 matchups are worth "more" than others? #20

Closed Zamiell closed 6 years ago

Zamiell commented 6 years ago

Story time:

Now, I want to feed in all of the league matches into this TrueSkill Python library for the purposes of calculating a skill leaderboard. But if I just feed in all of the matches, it won't be accurate, because the worst player in the league happened to beat the best player in the league when they played on a grass court game. Is there a way to tell TrueSkill that one match has less confidence than another?

sublee commented 6 years ago

There are 2 perspectives:

  1. You have separate 2 games (or game modes). Each court provides different physics environment. If you think so, separate ratings of a player for each game. e.g., Overwatch has "casual matchmaking" and "competitive play" modes. Perhaps they don't manage 1 rating for all modes.

  2. You have 1 game just with 2 maps. A map provides own challenge or environment. If you think so, manage only 1 rating of a player for all maps. Anyway, the rating will come up with the accurate skill rating.

The solution is depending on your perspective. In my case, I made a racing game with 4 modes and 10 tracks (maps.) I managed 4 ratings of a player for each game mode, not track. You will find your own solution.

Zamiell commented 6 years ago

Suggestion 2 is to just not care about the difference between grass or clay courts. That obviously isn't super desirable - its the entire reason I opened this thread in the first place, because such a leaderboard would be fraught with inaccuracies.

Suggestion 1 is to have two separate ratings. This isn't desirable either; allow me to explain why.

In your final example, each racing game mode has a separate rating. That makes sense, because the rules of the game are changed between each mode, and it stands to reason that a player who understands the rules of one mode, and is very skilled in that one mode, may not be as good at an entirely separate mode. Another good example of this would be normal Chess versus Chess960.

However, in my (hypothetical) example, this concept does not hold. The skill of players on clay courts is perfectly correlated with the skill of players on grass courts. It is the same rules, same physics, same everything. The only difference is that the "luck" factor (or the "confidence" factor) in the outcome changes, because of inserted randomness. So if I were to make two separate ratings, one for grass, and one for clay, the grass court rating would be essentially useless: given that players play an equal number of games on both clay and grass courts, in all cases you would ONLY want to look at the clay rating, because it would be more accurate. Furthermore, the clay rating would be missing out on all of the good input from the grass courts, because even though the grass court games are filled with errors, it at least has a 60-65% chance of the more skilled player winning the match.

So hopefully you can see now that I do not want to have separate ratings, and I don't want to just discard thousands of grass court games that would help "refine" the leaderboard to make it more accurate. Is this just a specific type of situation that TrueSkill is not equipped to handle?

Finally, let me ask: would it be a good idea to feed in the results for clay courts three times, and grass courts only once? (on the assumption that the more skilled player is three times as likely to win on clay versus grass) I'm not sure if that would distort how TrueSkill is supposed to work, or messes up the math, or whatnot.

sublee commented 6 years ago

Q1. What's your purpose? Are you trying to use TrueSkill ratings as scores on a leaderboard? Would your players see and manage their score?

Q2. Why do you think a worse player has more win probability on a grass court?

Zamiell commented 6 years ago

1) Yes, I am trying to use TrueSkill for the purposes of generating a leaderboard from thousands of matches. Yes, my players would be able to see their score. I don't know what you mean by "manage" though. 2) The example proposed in my initial post is a hypothetical example. I'm not actually trying to say that in real life, grass courts make the ball bounce randomly. The whole point is that I'm trying to frame my real use case in an hopefully-easy-to-understand situation. For the purposes of this discussion, you are supposed to take for a given that a worse player has a greater probability to win on a grass court. (In my real use case, there exists a similar random mechanic that I have no control over.)

sublee commented 6 years ago

I don't know what you mean by "manage" though.

I meant they should try to increase their scores.

Actually, I don't believe a TrueSkill rating can be a score on a public leaderboard. Its goal is to measure players' skill as a number, within samples as few as possible. Because it is designed to be used for matchmaking but not leaderboards. To implement an accurate matchmaker, it's better to hide ratings from players to keep them in pure.

Of course, if you won't implement a matchmaker, you can use TrueSkill ratings as a score. But probably, it will be hard to support the difference between the courts. If you need a score on a public leaderboard, try to design your own scoring system based on the courts.

Zamiell commented 6 years ago

1)

it's better to hide ratings from players to keep them in pure.

Can you explain why? Would there be some situation where a player would want to deliberately lose after seeing what their point value is, for example? In the context of this conversation, we should assume that players want to get as high on the leaderboard as possible by default, so I'm confused as to why this should matter.

2)

Would it be a good idea to feed in the results for clay courts three times, and grass courts only once? (on the assumption that the more skilled player is three times as likely to win on clay versus grass) I'm not sure if that would distort how TrueSkill is supposed to work, or messes up the math, or whatnot.

Can you answer this question?

3)

From the trueskill.org documentation:

expose(rating)
Returns the value of the rating exposure. It starts from 0 and converges to the mean. Use this as a sort key in a leaderboard:

leaderboard = sorted(ratings, key=env.expose, reverse=True)
New in version 0.4.

If TrueSkill is not meant to be used in a leaderboard, why is this in the documentation?

sublee commented 6 years ago

Good point. Actually, I've forgotten about expose() and misunderstood TrueSkill's original purpose. The paper is mentioning the leaderboard usage directly. So yes, expose() is designed for leaderboards.

Anyway, I still believe TrueSkill should not be used for leaderboards. My perspective is based on my online game experiences. When players can know their ratings, they usually try to cheat to achieve overestimated ratings. For example, less skilled players, who group with just one most skilled player, would have more chances to win against more skilled players. Or simply, a more skilled player uses a less skilled player's account to bring him/her up. I think manipulated ratings disturb matchmaking quality. But maybe it is my unfounded fear.

Before I answer, I have to check my understanding.

would it be a good idea to feed in the results for clay courts three times, and grass courts only once? (on the assumption that the more skilled player is three times as likely to win on clay versus grass)

In pseudo code:

if court == 'clay':
    rate_game_result()
elif court == 'grass':
    if random() < 1/3.:
        rate_game_result()
>>> win_probability(better_player, worse_player, 'clay')
0.9
>>> win_probability(worse_player, better_player, 'clay')
0.1
>>> win_probability(better_player, worse_player, 'grass')
0.3
>>> win_probability(worse_player, better_player, 'grass')
0.7
Zamiell commented 6 years ago

My perspective is based on my online game experiences. When players can know their ratings, they usually try to cheat to achieve overestimated ratings. For example, less skilled players, who group with just one most skilled player, would have more chances to win against more skilled players. Or simply, a more skilled player uses a less skilled player's account to bring him/her up. I think manipulated ratings disturb matchmaking quality. But maybe it is my unfounded fear.

For my purpose (tennis), there only exists 1v1 matchups. So since there is no team-based play, there is no possible way to cheat in the way you describe here.

Did you mean to sample only 1/3 of grass court games while processing every clay court games?

No. This is what I had in mind, but I have no idea if it is a good solution or not:

for game in game_list:
    if game['court'] == 'clay':
        for i in range(3):
            calculate_new_trueskill(game['winner'], game['loser'])
    elif game['court'] == 'grass':
        calculate_new_trueskill(game['winner'], game['loser'])

I was hoping there would be some more elegant way to handle this problem provided directly by TrueSkill. =(

sublee commented 6 years ago

Can you also answer to questions about 3x win probability?

Zamiell commented 6 years ago

Is only more skilled players' win probability multiplied 3 times on clay courts than grass courts?

Not sure what this means. For the purposes of this example scenario, we can arbitrarily say that a more skilled player has a 3x chance of winning on a clay court. However, in real life, this wouldn't be the case. The numerical multiplier would shift depending on the skill disparity between the two players. If they were very close in skill, it might only be a 0.25x multiplier, for example. And then it would shift larger and larger as the skill disparity increases, all the way up to 3x, or something along those lines. Hopefully that makes sense. ><

If so, are less skilled players having more chance to win on grass courts?

Yes. If facing a stronger opponent, it is always advantageous for a lesser-skilled player to play on a grass court (by definition).

sublee commented 6 years ago

In your game, one TrueSkill rating has a different meaning on clay courts and grass courts. In TrueSkill, a more skilled player should have more chance to win against a less skilled player, always. Your attempt looks like to share one skill detector between a sprint race and a car race.

I feel "3x more sampling," you asked is weird. In my opinion, you must separate ratings for each court. And you should make some aggregation algorithm to calculate a score for the leaderboard from the 2 ratings.

Zamiell commented 6 years ago

That makes sense, thank you!

So would this be a good example of an aggregate rating?

aggregate_trueskill = (clay_trueskill + clay_trueskill + clay_trueskill + grass_trueskill) / 4
sublee commented 6 years ago

It depends on your game design. But that's what I imagined.

Zamiell commented 6 years ago

Ok. Thank you so much for taking the time to answer these questions. Hopefully it will be useful to someone else (from a google search) in the future.