online-go / goratings

This repository contains the (future) official rating and ranking system for online-go.com, as well as analysis code and data to develop that system and compare it to other reference systems.
MIT License
19 stars 6 forks source link

Ratings should consider komi on small boards #45

Open dexonsmith opened 11 months ago

dexonsmith commented 11 months ago

Ratings currently don't consider komi on small boards, but should, since usually it's the komi that changes (not the number of handicap stones) as handicap increases.

dexonsmith commented 11 months ago

I've started adding this to RatingsMath in a1486cb9b2fd27d9ccaa125896e2f7d54ff948ac, but the individual scripts still need an update to send the board size, ruleset, and komi into calculate_handicap, so I can't test it out yet.

This also relates to #7, since the 19x19 version is essentially equivalent (at least, to the intent of #7, not sure about what's currently in the repo).

dexonsmith commented 11 months ago

(The commit there uses the math from this WIP proposal: https://github.com/dexonsmith/online-go.com/blob/27bcbdc7699cf4fedc072336a4c36ab40897c876/doc/proposal-redesign-small-board-handicap-komi.md ... this is homework for the ratings part of the proposal.)

dexonsmith commented 11 months ago

@anoek, the "ruleset" seems to be missing from the historical ratings database.

(See also the PR #46.)

Also, there are some rated games in there with massive handicaps. E.g., this 8-stone 9x9 game raised an adjusted_handicap < 50 assertion:

Processing OGS data
     243,876 /   15,123,682 games processed.   92.3s remaining
             size = 9
             komi = -2.5
         handicap = 8
       komi_bonus = 0
adjusted_handicap = 52.25

Indeed, an 8-stone handicap on a 9x9 board is a big advantage. Seems unnecessary to rate this game at all...

BHydden commented 11 months ago

anoek doesn't change the ratings calculations lightly. He also usually handles that himself and does a lot of simming before pushing anything. Just a heads up before you put too much work into this ❤️ love the work you've been doing recently, wouldn't want you to get discouraged 😛

dexonsmith commented 11 months ago

anoek doesn't change the ratings calculations lightly. He also usually handles that himself and does a lot of simming before pushing anything. Just a heads up before you put too much work into this ❤️ love the work you've been doing recently, wouldn't want you to get discouraged 😛

Thanks for the heads up :). Already chatted with him about this, and we need to do something for small boards. Ratings adjustments are currently (and perhaps have always?) treating the small board handicaps as "1-stone == 1-rank" (as-if 19x19), which is completely haywire.

BHydden commented 11 months ago

Cool sounds good 👍 good luck ❤️ I agree 1 stone per rank on small boards is bonkers haha

dexonsmith commented 11 months ago

Some data from running ./analysis/analyze_glicko2_one_game_at_a_time.py (hardcoding "japanese"): compute-handicap-via-komi-baseline.txt compute-handicap-via-komi-small.txt compute-handicap-via-komi-19x19.txt compute-handicap-via-komi-small+19x19.txt

Haven't looked closely, since I'm not sure I'll know how to interpret it.

Always dies for me with this traceback:

Traceback (most recent call last):
  File "/Users/dexonsmith/Repos/online-go/goratings/analysis/analyze_glicko2_one_game_at_a_time.py", line 116, in <module>
    tally.print()
  File "/Users/dexonsmith/Repos/online-go/goratings/analysis/util/TallyGameAnalytics.py", line 140, in print
    self.print_self_reported_stats()
  File "/Users/dexonsmith/Repos/online-go/goratings/analysis/util/TallyGameAnalytics.py", line 285, in print_self_reported_stats
    stats = self.get_self_reported_stats()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dexonsmith/Repos/online-go/goratings/analysis/util/TallyGameAnalytics.py", line 339, in get_self_reported_stats
    raise Exception('Failed to find self_repoted_account_links json file')
Exception: Failed to find self_repoted_account_links json file
dexonsmith commented 11 months ago

Here are the compact stats:

| Algorithm name | Stronger wins | h0 | h1 | h2 |
|:---------------|--------------:|---:|---:|--------------:|
| glicko2_one_ga |         68.3% | 68.5% | 67.4% | 67.0% | baseline
| glicko2_one_ga |         68.4% | 68.6% | 67.5% | 67.4% | small
| glicko2_one_ga |         69.0% | 68.7% | 67.9% | 71.9% | 19x19
| glicko2_one_ga |         69.0% | 68.7% | 67.9% | 70.9% | both

Interesting to see a modest improvement in the compact data for 19x19 but not much for small boards... could be the new math isn't quite right, or maybe for some (or all?) of the data the "handicap" value is storing a "handicap rank difference" (not stones) after all.

dexonsmith commented 11 months ago

Other scripts available as of 330bba9e053455f7c47ea3162833d0b732da3967 (I didn't test them but think the updates are correct).

Also thought of another two possibilities:

dexonsmith commented 11 months ago

After 080b385, "both" gets:

| Algorithm name | Stronger wins | h0 | h1 | h2 |
|:---------------|--------------:|---:|---:|--------------:|
| glicko2_one_ga |         69.0% | 68.7% | 67.8% | 70.8% |

Pretty similar.

One thing I'd like to do is "skip" some games as unrate-able (i.e., does not affect the rating), say if the rank adjustment is bigger than 20. Or 9. Or something. Not sure there's an easy way to do that right now though.

dexonsmith commented 11 months ago

One thing I'd like to do is "skip" some games as unrate-able (i.e., does not affect the rating), say if the rank adjustment is bigger than 20. Or 9. Or something. Not sure there's an easy way to do that right now though.

I guess the way to do this is to skip them in the caller.

I'd want to skip these games both for the purposes of:

@anoek, I haven't looked yet at how to do this yet (hopefully I'll pry myself away and won't get to it a few days), but curious if you have thoughts on (a) how this should be structured in the goratings code and (b) whether the number of skipped games is worth printing stats on.

dexonsmith commented 11 months ago

Might also be interesting to see what happens to ratings if ALL small board handicap games are skipped... not the right end result, but a useful baseline.

anoek commented 11 months ago

Yeah given how bad the handicaps are for 9x9 and 13x13, I too am tempted to just throw those away too. Might be worth the experiment to consider them like you're proposing too to see if it can be useful, but it might just be a detriment.

dexonsmith commented 11 months ago

Looks like EGF and AGA datasets just have 19x19 games. Here are the compact results from running on them:

Algorithm name Stronger wins h0 h1 h2 dataset / options
glicko2_one_ga 68.5% 69.1% 68.2% 66.5% aga baseline
glicko2_one_ga 70.0% 69.7% 68.7% 71.1% aga this branch
glicko2_one_ga 69.5% 68.7% 68.6% 72.0% egf baseline
glicko2_one_ga 67.9% 67.8% 68.4% 69.6% egf this branch

Again, assuming Japanese rules for all of them.

What rules does EGF use?

anoek commented 11 months ago

AGA uses AGA rules, EGF I think uses Japanese? There might be some flexibility for both organizations, I'm not entirely sure.

dexonsmith commented 11 months ago

Do you trust the komi values in the AGA and EGF datasets?

Note that the following assertion passes for all games in both datasets:

assert game.komi == 0

(Makes sense for handicap games, but I imagine they use komi for even games?)

anoek commented 11 months ago

Yep I would trust them

dexonsmith commented 11 months ago

Interesting. There are a couple of reasons for games to be ignored in the tallies

        if (result.black_deviation > PROVISIONAL_DEVIATION_CUTOFF or
            result.white_deviation > PROVISIONAL_DEVIATION_CUTOFF):
            self.games_ignored += 1
            return

        if abs(result.black_rank + result.game.handicap - result.white_rank) > 1:
            self.games_ignored += 1
            return
If I comment that out, I get these results for --aga (hardcoding AGA rules and komi=7.5 for even) Algorithm name Stronger wins h0 h1 h2 options
glicko2_one_ga 68.6% 68.2% 68.2% 69.8% baseline
glicko2_one_ga 79.6% 72.3% 74.5% 83.0% this branch

git-blame tells me it has been that way since the initial commit in 68d0fff9. Do you remember why we're ignoring those games?

dexonsmith commented 11 months ago

Yep I would trust them

But don't AGA rules say they use komi for even games? Do they skip that in tournaments?

anoek commented 11 months ago

Yep, I might be confused here, are you seeing zeros for komi? That'd be weird. I thought your prior assertion ensured there were no zeros, if there's a value in there I have no reason not to trust them the komi values they provided, as I understand it these are database dumps directly from each organizations datrabases. If there are zeros, then that's not accurate I'd wager, but if they provide a komi I reckon that's as accurate as it can be given that there are probably humans filling out a lot of those values based on whatever tournaments the game came from.

dexonsmith commented 11 months ago

Okay, this is:

EDIT: actually, I lost track of which data this is. Re-running.

dexonsmith commented 11 months ago

Yep, I might be confused here, are you seeing zeros for komi? That'd be weird. I thought your prior assertion ensured there were no zeros, if there's a value in there I have no reason not to trust them the komi values they provided, as I understand it these are database dumps directly from each organizations datrabases. If there are zeros, then that's not accurate I'd wager, but if they provide a komi I reckon that's as accurate as it can be given that there are probably humans filling out a lot of those values based on whatever tournaments the game came from.

Yeah, both --aga and --egf have all zeroes for komi.

dexonsmith commented 11 months ago

Okay, this is:

  • --ogs (hardcoding Japanese, trusting komi)
  • ignoring PROVISIONAL_DEVIATION_CUTOFF (as in HEAD)
  • NOT ignoring effective handicap bigger than 1

EDIT: actually, I lost track of which data this is. Re-running.

Yeah, I had the wrong code commented out when I was just doing one of them, and got it backwards.

Still interesting in why these are being ignored.

(For looking at improvements to the small board analysis, I definitely need to look at effective handicap bigger than 1)

dexonsmith commented 11 months ago

Update: you can ignore all my "baseline" numbers above :/. In my first commit on the branch, I somehow (???) corrupted the get_handicap_adjustment that the baseline measurements use. Reverted that mistake in 7435f266b7b60f906f6bac97a405b67725da1404. Haven't rerun numbers yet.

dexonsmith commented 11 months ago

Haven't rerun numbers yet.

As of 608e551, skipping rating games with effective handicaps bigger than 9.

New numbers:

Algorithm name Stronger wins h0 h1 h2 dataset options
glicko2_one_ga 68.8% 68.6% 68.7% 69.3% ogs baseline
glicko2_one_ga 69.0% 68.7% 67.8% 71.2% ogs this branch
anoek commented 11 months ago
        if abs(result.black_rank + result.game.handicap - result.white_rank) > 1:
            self.games_ignored += 1
            return

This code, only used for the analytics part, says

If this is a game between two people with an appropriate handicap (as judged by the outcome of our choice of rating parameters this run) - then include it in the stats. Otherwise discard the result for the purpose of our stats.

In other words, if in our game history we had a 2 stone handicap game between a 5d and a 1kyu, we don't want to tally this into our stats because it'll skew the results. The purpose of these stats is to judge how good our curve fit between rating and ranking is, so to do that we're optimizing on minimizing then difference in win rates between our handicap games and our non handicap games when the appropriate handicap is used in a game.

dexonsmith commented 11 months ago
        if abs(result.black_rank + result.game.handicap - result.white_rank) > 1:
            self.games_ignored += 1
            return

This code, only used for the analytics part, says

If this is a game between two people with an appropriate handicap (as judged by the outcome of our choice of rating parameters this run) - then include it in the stats. Otherwise discard the result for the purpose of our stats.

In other words, if in our game history we had a 2 stone handicap game between a 5d and a 1kyu, we don't want to tally this into our stats because it'll skew the results. The purpose of these stats is to judge how good our curve fit between rating and ranking is, so to do that we're optimizing on minimizing then difference in win rates between our handicap games and our non handicap games when the appropriate handicap is used in a game.

Okay, that makes sense. I see the value in seeing the curve fit with that data excluded.

But, if that data is just ignored (and we don't look at it anywhere), I feel like it can hide problems.

E.g., when I uncomment it on --ogs, it LOWERS the win rate of the stronger player. Wouldn't we expect including that data to increase the win rate? (Or maybe I'm not understanding what the "stronger player" metric is.)

dexonsmith commented 11 months ago

Here's the data (re-run, since I don't trust the stuff I printed before finding the weird bug I inserted in the baseline):

Algorithm name Stronger wins h0 h1 h2 options ignore? judgement
glicko2_one_ga 68.8% 68.6% 68.7% 69.3% baseline ignore stones
glicko2_one_ga 68.8% 68.6% 68.7% 69.3% baseline ignore rank difference
glicko2_one_ga 69.0% 68.7% 67.8% 71.2% this branch ignore stones
glicko2_one_ga 68.9% 68.7% 68.5% 70.4% this branch ignore rank difference
glicko2_one_ga 61.8% 60.4% 67.6% 68.6% baseline include stones
glicko2_one_ga 61.8% 60.4% 67.6% 68.6% baseline include rank difference
glicko2_one_ga 61.9% 60.4% 66.8% 70.3% this branch include stones
glicko2_one_ga 61.9% 60.4% 66.8% 70.3% this branch include rank difference

Although, come to think of it, maybe the problem is the 9x9 and 13x13 games. Let me see what happens if I ignore based on computed handicap_rank_difference and report back.

EDIT: updated with the data, where the judgement is based on handicap_rank_difference (6b5a96b). Not much difference.

dexonsmith commented 11 months ago

Also tried adding --size=19 to the command line, just to totally exclude small boards. Still seeing LOWER rates for "stronger wins" when including badly mismatched games. This just doesn't make sense to me. When games are badly mismatched, the stronger player should almost always win.

dexonsmith commented 11 months ago

Interestingly, also found that compact stats are completely ignoring small boards:

        prediction = (
            self.prediction_cost[19][ALL][ALL][ALL] / max(1, self.count[19][ALL][ALL][ALL])
        )
        prediction_h0 = (
            self.prediction_cost[19][ALL][ALL][0] / max(1, self.count[19][ALL][ALL][0])
        )
        prediction_h1 = (
            self.prediction_cost[19][ALL][ALL][1] / max(1, self.count[19][ALL][ALL][1])
        )
        prediction_h2 = (
            self.prediction_cost[19][ALL][ALL][2] / max(1, self.count[19][ALL][ALL][2])
        )
dexonsmith commented 11 months ago

Interestingly, also found that compact stats are completely ignoring small boards:

This goes back to the initial commit in 51c291abe65ae2239468b6eb4afa60977d8ac846.

dexonsmith commented 10 months ago

Trying to page this back in after a few weeks away.

51 landed, which is the basic options. It fails to run because of some very large effective handicap games. See the comments there on what to do about it. (Some of which were implemented in the abandoned #46 that it replaced.)

I'm still super curious about the following:

That suggests to me some sort of fundamental problem/weirdness with the OGS data.

I feel like this discrepancy is important to understand...

anoek commented 10 months ago

Just to note, we're talking about the statistics we use to gauge how well the parameters we've chosen for our rating system, in particular our rating to ranking curve since those are the ones we really twiddle, are performing.

Here's my logic for discarding those other values:

If you have two players that are equal strength, then depending on komi, we expect black to win somewhere in the 50-56% range.

Ideally, if you have a player that is "3 ranks higher", aka "3 stones stronger" than their opponent, and they give their opponent a 3 stone handicap, then we'd expect black to win that same 50-56% of the time.

The primary goal of this repo is to tune the parameters used to fit the rating to ranking curve such that we minimize the divergence of the black win rate for handicap games from that of even games. That is to say, the win rates for all of our handicap 1,2 , 3 etc games where the ranks were 1, 2, 3 apart with white being the stronger player, those should all have black winning ~50-56% of the time. If we have a skew, say black winning 70% of the time or something, we know our ranks are not the right distance apart. For example, if black was winning too much, it would mean the ranks were too close because on average white is giving too much handicap for how strong their rank claims.

Now say we have a handicap 2 game but the rank difference is 5 with white being the stronger player, then we'd expect black to win with some small chance way less than 50%, hence why we don't include them in our statistics that we are using to measure how well our ranking system is at determining good handicap values - it'd just skew the results.

I could see an argument made that if the distribution of games where you had mismatched rank/handicap combinations was somewhat normal, then including those values would still improve the average, but I've been operating under the assumption that it's not normal and has a bit of structure to it since we have years of data of games played with different rating and ranking systems, so you've got bias coming from those systems, then you also have manually specified handicap games which, anecdotally, I suspect are bias towards not providing enough handicap. Hence, throwing all those values out for our purposes here.

dexonsmith commented 10 months ago

Now say we have a handicap 2 game but the rank difference is 5 with white being the stronger player, then we'd expect black to win with some small chance way less than 50%

I agree that is what we'd expect, but it's not what I'm seeing with OGS data. (With EGF and AGA data, that's indeed what we see...)

Instead, with OGS data, including those results makes the stronger player win less often than when excluding them. See the "stronger wins" column from the 8-row table above with an "ignore?" column.

That's what I'm puzzled about. The "ignore?=include" rows should have stronger winning 80% (or something), not 62%.

dexonsmith commented 10 months ago

When I get a chance (maybe not until the weekend?) I'll reproduce and post a patch which adds a command-line option to include those results, so you can review and reproduce yourself.

anoek commented 10 months ago

Ok, as per usual I'm a little out of sync and adding to the confusion. There's a few stats. The one that I've pretty much exclusively cared about for optimizing on is the handicap performance, the first set of numbers. Excluding mismatched rank/handicap games from that is important. But the code you are talking about has nothing to do with this, so what I wrote above is moot, sorry for the confusion.

You're looking at the stronger player wins stat. That stat we're not targeting ~50%, I think we'd optimally target consistency, so 62% vs 69% isn't important, it's that it's 62% across the board or 69% across the board. That said, pretty sure I wasn't optimizing on that, it was more just a curiosity and bench marking thing I think.

But on to the actual question, what was the purpose of this block of code and should it remain there, and why when we remove it does our stronger win stat go down and not up?

        if abs(result.black_rank + result.game.handicap - result.white_rank) > 1:
            self.games_ignored += 1
            return

As for why it's there, it's probably there because I wanted those individual numbers to be comparable to one another and exclude some of the inherent bias that might exist of inadvertent or purposeful sandbagging or helium bagging.

There's also a published EGF stat https://en.wikipedia.org/wiki/Go_ranks_and_ratings that notes that in their rating system a 1k vs 2k in an even game has a 71.3% chance of winning, so while not directly comparable because we're using fractional rankings here and looking at 0 <= R < 1 as opposed to something like round(Black) + handicap - round(White) == 1, it's probably something of a sanity check I was using, noting that the value is somewhat close.

HOWEVER The real elephant in the room, the thing that doesn't pass the smell test, the bug - why when you remove that does our win rate for stronger players not go up? Pretty sure it's that the value being displayed is not in fact the stronger win rate at all like it should be but rather some number we get out of prediction_cost, so all those values are wrong. Specifically https://github.com/online-go/goratings/blob/master/analysis/util/TallyGameAnalytics.py#L144-L155 we're reading from prediction_cost instead of what I think should be predicted_outcome.

Changing that produces some results that align more with our intuition, also the expected values seem better.

As for if that condition should remain or not, I think that just comes down to what we're hoping to learn from that stat in the first place. It's interesting, but unclear if it's particularly useful for tuning.

dexonsmith commented 10 months ago

Thanks, that's helpful!

dexonsmith commented 10 months ago

A few updates.

2-3 more pull requests out:

I had a look at "black wins" results with those merged.

I think that's good? We expect handicaps to be most accurate for strong players, who play most consistently.

here's the baseline (after applying those patches): baseline.txt

here's the result with --handicap-rank-difference-{19x19,small} (after applying those patches): options.txt