dexonsmith commented 11 months ago

Ratings currently don't consider komi on small boards, but should, since usually it's the komi that changes (not the number of handicap stones) as handicap increases.

dexonsmith commented 11 months ago

I've started adding this to RatingsMath in a1486cb9b2fd27d9ccaa125896e2f7d54ff948ac, but the individual scripts still need an update to send the board size, ruleset, and komi into calculate_handicap, so I can't test it out yet.

This also relates to #7, since the 19x19 version is essentially equivalent (at least, to the intent of #7, not sure about what's currently in the repo).

dexonsmith commented 11 months ago

(The commit there uses the math from this WIP proposal: https://github.com/dexonsmith/online-go.com/blob/27bcbdc7699cf4fedc072336a4c36ab40897c876/doc/proposal-redesign-small-board-handicap-komi.md ... this is homework for the ratings part of the proposal.)

dexonsmith commented 11 months ago

@anoek, the "ruleset" seems to be missing from the historical ratings database.

Is this possible to fix / get access to?
Does the komi in this database incorporate the handicap komi that AGA and Chinese rules add?

(See also the PR #46.)

Also, there are some rated games in there with massive handicaps. E.g., this 8-stone 9x9 game raised an adjusted_handicap < 50 assertion:

Processing OGS data
     243,876 /   15,123,682 games processed.   92.3s remaining
             size = 9
             komi = -2.5
         handicap = 8
       komi_bonus = 0
adjusted_handicap = 52.25

Indeed, an 8-stone handicap on a 9x9 board is a big advantage. Seems unnecessary to rate this game at all...

BHydden commented 11 months ago

anoek doesn't change the ratings calculations lightly. He also usually handles that himself and does a lot of simming before pushing anything. Just a heads up before you put too much work into this ❤️ love the work you've been doing recently, wouldn't want you to get discouraged 😛

dexonsmith commented 11 months ago

anoek doesn't change the ratings calculations lightly. He also usually handles that himself and does a lot of simming before pushing anything. Just a heads up before you put too much work into this ❤️ love the work you've been doing recently, wouldn't want you to get discouraged 😛

Thanks for the heads up :). Already chatted with him about this, and we need to do something for small boards. Ratings adjustments are currently (and perhaps have always?) treating the small board handicaps as "1-stone == 1-rank" (as-if 19x19), which is completely haywire.

BHydden commented 11 months ago

Cool sounds good 👍 good luck ❤️ I agree 1 stone per rank on small boards is bonkers haha

dexonsmith commented 11 months ago

Some data from running ./analysis/analyze_glicko2_one_game_at_a_time.py (hardcoding "japanese"): compute-handicap-via-komi-baseline.txt compute-handicap-via-komi-small.txt compute-handicap-via-komi-19x19.txt compute-handicap-via-komi-small+19x19.txt

Haven't looked closely, since I'm not sure I'll know how to interpret it.

Always dies for me with this traceback:

Traceback (most recent call last):
  File "/Users/dexonsmith/Repos/online-go/goratings/analysis/analyze_glicko2_one_game_at_a_time.py", line 116, in <module>
    tally.print()
  File "/Users/dexonsmith/Repos/online-go/goratings/analysis/util/TallyGameAnalytics.py", line 140, in print
    self.print_self_reported_stats()
  File "/Users/dexonsmith/Repos/online-go/goratings/analysis/util/TallyGameAnalytics.py", line 285, in print_self_reported_stats
    stats = self.get_self_reported_stats()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dexonsmith/Repos/online-go/goratings/analysis/util/TallyGameAnalytics.py", line 339, in get_self_reported_stats
    raise Exception('Failed to find self_repoted_account_links json file')
Exception: Failed to find self_repoted_account_links json file

dexonsmith commented 11 months ago

Here are the compact stats:

| Algorithm name | Stronger wins | h0 | h1 | h2 |
|:---------------|--------------:|---:|---:|--------------:|
| glicko2_one_ga |         68.3% | 68.5% | 67.4% | 67.0% | baseline
| glicko2_one_ga |         68.4% | 68.6% | 67.5% | 67.4% | small
| glicko2_one_ga |         69.0% | 68.7% | 67.9% | 71.9% | 19x19
| glicko2_one_ga |         69.0% | 68.7% | 67.9% | 70.9% | both

Interesting to see a modest improvement in the compact data for 19x19 but not much for small boards... could be the new math isn't quite right, or maybe for some (or all?) of the data the "handicap" value is storing a "handicap rank difference" (not stones) after all.

dexonsmith commented 11 months ago

Other scripts available as of 330bba9e053455f7c47ea3162833d0b732da3967 (I didn't test them but think the updates are correct).

Also thought of another two possibilities:

Might be very few games with small board handicaps, so the data doesn't affect the total much
The dataset might have a mix of games from OGS and other sources. The other sources may be storing "handicap rank difference" in the handicap field, even if OGS is storing "handicap stones".

dexonsmith commented 11 months ago

After 080b385, "both" gets:

| Algorithm name | Stronger wins | h0 | h1 | h2 |
|:---------------|--------------:|---:|---:|--------------:|
| glicko2_one_ga |         69.0% | 68.7% | 67.8% | 70.8% |

Pretty similar.

One thing I'd like to do is "skip" some games as unrate-able (i.e., does not affect the rating), say if the rank adjustment is bigger than 20. Or 9. Or something. Not sure there's an easy way to do that right now though.

dexonsmith commented 11 months ago

One thing I'd like to do is "skip" some games as unrate-able (i.e., does not affect the rating), say if the rank adjustment is bigger than 20. Or 9. Or something. Not sure there's an easy way to do that right now though.

I guess the way to do this is to skip them in the caller.

For 19x19, skip if handicap > 9 (shouldn't be any in the DB...)
For 13x13, skip if handicap > 5 (effective rank diff of 10-15, assuming scaling factor of 2.5-3x)
For 9x9, skip if handicap > 3 (effective rank diff of 12-18, assuming scaling factor of 4-6x... maybe even > 2 would be better)

I'd want to skip these games both for the purposes of:

Adjusting ratings
Computing stats on how effective the ratings are

@anoek, I haven't looked yet at how to do this yet (hopefully I'll pry myself away and won't get to it a few days), but curious if you have thoughts on (a) how this should be structured in the goratings code and (b) whether the number of skipped games is worth printing stats on.

dexonsmith commented 11 months ago

Might also be interesting to see what happens to ratings if ALL small board handicap games are skipped... not the right end result, but a useful baseline.

anoek commented 11 months ago

Yeah given how bad the handicaps are for 9x9 and 13x13, I too am tempted to just throw those away too. Might be worth the experiment to consider them like you're proposing too to see if it can be useful, but it might just be a detriment.

dexonsmith commented 11 months ago

Looks like EGF and AGA datasets just have 19x19 games. Here are the compact results from running on them:

Algorithm name	Stronger wins	h0	h1	h2	dataset / options
glicko2_one_ga	68.5%	69.1%	68.2%	66.5%	aga baseline
glicko2_one_ga	70.0%	69.7%	68.7%	71.1%	aga this branch
glicko2_one_ga	69.5%	68.7%	68.6%	72.0%	egf baseline
glicko2_one_ga	67.9%	67.8%	68.4%	69.6%	egf this branch

Again, assuming Japanese rules for all of them.

What rules does EGF use?

anoek commented 11 months ago

AGA uses AGA rules, EGF I think uses Japanese? There might be some flexibility for both organizations, I'm not entirely sure.

dexonsmith commented 11 months ago

Do you trust the komi values in the AGA and EGF datasets?

Note that the following assertion passes for all games in both datasets:

assert game.komi == 0

(Makes sense for handicap games, but I imagine they use komi for even games?)

anoek commented 11 months ago

Yep I would trust them

dexonsmith commented 11 months ago

Interesting. There are a couple of reasons for games to be ignored in the tallies

        if (result.black_deviation > PROVISIONAL_DEVIATION_CUTOFF or
            result.white_deviation > PROVISIONAL_DEVIATION_CUTOFF):
            self.games_ignored += 1
            return

        if abs(result.black_rank + result.game.handicap - result.white_rank) > 1:
            self.games_ignored += 1
            return

If I comment that out, I get these results for `--aga` (hardcoding AGA rules and komi=7.5 for even)	Algorithm name	Stronger wins	h0	h1	h2	options
glicko2_one_ga	68.6%	68.2%	68.2%	69.8%	baseline
glicko2_one_ga	79.6%	72.3%	74.5%	83.0%	this branch

git-blame tells me it has been that way since the initial commit in 68d0fff9. Do you remember why we're ignoring those games?

dexonsmith commented 11 months ago

Yep I would trust them

But don't AGA rules say they use komi for even games? Do they skip that in tournaments?

anoek commented 11 months ago

Yep, I might be confused here, are you seeing zeros for komi? That'd be weird. I thought your prior assertion ensured there were no zeros, if there's a value in there I have no reason not to trust them the komi values they provided, as I understand it these are database dumps directly from each organizations datrabases. If there are zeros, then that's not accurate I'd wager, but if they provide a komi I reckon that's as accurate as it can be given that there are probably humans filling out a lot of those values based on whatever tournaments the game came from.

dexonsmith commented 11 months ago

Okay, this is:

--ogs (hardcoding Japanese, trusting komi)
ignoring PROVISIONAL_DEVIATION_CUTOFF (as in HEAD)
NOT ignoring effective handicap bigger than 1

EDIT: actually, I lost track of which data this is. Re-running.

dexonsmith commented 11 months ago

Yep, I might be confused here, are you seeing zeros for komi? That'd be weird. I thought your prior assertion ensured there were no zeros, if there's a value in there I have no reason not to trust them the komi values they provided, as I understand it these are database dumps directly from each organizations datrabases. If there are zeros, then that's not accurate I'd wager, but if they provide a komi I reckon that's as accurate as it can be given that there are probably humans filling out a lot of those values based on whatever tournaments the game came from.

Yeah, both --aga and --egf have all zeroes for komi.

dexonsmith commented 11 months ago

Okay, this is:

--ogs (hardcoding Japanese, trusting komi)

ignoring PROVISIONAL_DEVIATION_CUTOFF (as in HEAD)

NOT ignoring effective handicap bigger than 1

EDIT: actually, I lost track of which data this is. Re-running.

Yeah, I had the wrong code commented out when I was just doing one of them, and got it backwards.

Ignoring PROVISIONAL_DEVIATION_CUTOFF has very little effect.
Ignoring effective handicap bigger than 1 has a huge effect.

Still interesting in why these are being ignored.

(For looking at improvements to the small board analysis, I definitely need to look at effective handicap bigger than 1)

dexonsmith commented 11 months ago

Update: you can ignore all my "baseline" numbers above :/. In my first commit on the branch, I somehow (???) corrupted the get_handicap_adjustment that the baseline measurements use. Reverted that mistake in 7435f266b7b60f906f6bac97a405b67725da1404. Haven't rerun numbers yet.

dexonsmith commented 11 months ago

Haven't rerun numbers yet.

As of 608e551, skipping rating games with effective handicaps bigger than 9.

New numbers:

Algorithm name	Stronger wins	h0	h1	h2	dataset	options
glicko2_one_ga	68.8%	68.6%	68.7%	69.3%	ogs	baseline
glicko2_one_ga	69.0%	68.7%	67.8%	71.2%	ogs	this branch

anoek commented 11 months ago

        if abs(result.black_rank + result.game.handicap - result.white_rank) > 1:
            self.games_ignored += 1
            return

This code, only used for the analytics part, says

If this is a game between two people with an appropriate handicap (as judged by the outcome of our choice of rating parameters this run) - then include it in the stats. Otherwise discard the result for the purpose of our stats.

In other words, if in our game history we had a 2 stone handicap game between a 5d and a 1kyu, we don't want to tally this into our stats because it'll skew the results. The purpose of these stats is to judge how good our curve fit between rating and ranking is, so to do that we're optimizing on minimizing then difference in win rates between our handicap games and our non handicap games when the appropriate handicap is used in a game.

dexonsmith commented 11 months ago

        if abs(result.black_rank + result.game.handicap - result.white_rank) > 1:
            self.games_ignored += 1
            return
This code, only used for the analytics part, says

If this is a game between two people with an appropriate handicap (as judged by the outcome of our choice of rating parameters this run) - then include it in the stats. Otherwise discard the result for the purpose of our stats.

In other words, if in our game history we had a 2 stone handicap game between a 5d and a 1kyu, we don't want to tally this into our stats because it'll skew the results. The purpose of these stats is to judge how good our curve fit between rating and ranking is, so to do that we're optimizing on minimizing then difference in win rates between our handicap games and our non handicap games when the appropriate handicap is used in a game.

Okay, that makes sense. I see the value in seeing the curve fit with that data excluded.

But, if that data is just ignored (and we don't look at it anywhere), I feel like it can hide problems.

E.g., when I uncomment it on --ogs, it LOWERS the win rate of the stronger player. Wouldn't we expect including that data to increase the win rate? (Or maybe I'm not understanding what the "stronger player" metric is.)

dexonsmith commented 11 months ago

Here's the data (re-run, since I don't trust the stuff I printed before finding the weird bug I inserted in the baseline):

Algorithm name	Stronger wins	h0	h1	h2	options	ignore?	judgement
glicko2_one_ga	68.8%	68.6%	68.7%	69.3%	baseline	ignore	stones
glicko2_one_ga	68.8%	68.6%	68.7%	69.3%	baseline	ignore	rank difference
glicko2_one_ga	69.0%	68.7%	67.8%	71.2%	this branch	ignore	stones
glicko2_one_ga	68.9%	68.7%	68.5%	70.4%	this branch	ignore	rank difference
glicko2_one_ga	61.8%	60.4%	67.6%	68.6%	baseline	include	stones
glicko2_one_ga	61.8%	60.4%	67.6%	68.6%	baseline	include	rank difference
glicko2_one_ga	61.9%	60.4%	66.8%	70.3%	this branch	include	stones
glicko2_one_ga	61.9%	60.4%	66.8%	70.3%	this branch	include	rank difference

Although, come to think of it, maybe the problem is the 9x9 and 13x13 games. Let me see what happens if I ignore based on computed handicap_rank_difference and report back.

EDIT: updated with the data, where the judgement is based on handicap_rank_difference (6b5a96b). Not much difference.

dexonsmith commented 11 months ago

Also tried adding --size=19 to the command line, just to totally exclude small boards. Still seeing LOWER rates for "stronger wins" when including badly mismatched games. This just doesn't make sense to me. When games are badly mismatched, the stronger player should almost always win.

dexonsmith commented 11 months ago

Interestingly, also found that compact stats are completely ignoring small boards:

        prediction = (
            self.prediction_cost[19][ALL][ALL][ALL] / max(1, self.count[19][ALL][ALL][ALL])
        )
        prediction_h0 = (
            self.prediction_cost[19][ALL][ALL][0] / max(1, self.count[19][ALL][ALL][0])
        )
        prediction_h1 = (
            self.prediction_cost[19][ALL][ALL][1] / max(1, self.count[19][ALL][ALL][1])
        )
        prediction_h2 = (
            self.prediction_cost[19][ALL][ALL][2] / max(1, self.count[19][ALL][ALL][2])
        )

dexonsmith commented 11 months ago

Interestingly, also found that compact stats are completely ignoring small boards:

This goes back to the initial commit in 51c291abe65ae2239468b6eb4afa60977d8ac846.

dexonsmith commented 10 months ago

Trying to page this back in after a few weeks away.

51 landed, which is the basic options. It fails to run because of some very large effective handicap games. See the comments there on what to do about it. (Some of which were implemented in the abandoned #46 that it replaced.)

I'm still super curious about the following:

When collating statistics on the rating quality, we skip games that are "mismatched" — where the effective rank difference between players (after incorporating handicap) is bigger than 1.
I'd expect that including mismatched games in the stats would make it MORE likely that the stronger player wins.
This is what happens on EGF and AGA data. If you include all games in statistics, we get way better numbers for "stronger player wins" (vs. skipping mismatched games).
This is NOT what happens on OGS data. If you include all games in statistics, we get way worse numbers for "stronger player wins" (vs. skipping mismatched games).
IIRC, this broken expectation happens if you only run --size=19... but I should double-check, since it has been a few weeks.

That suggests to me some sort of fundamental problem/weirdness with the OGS data.

Is it dominated by sandbaggers? (unlikely)
Does capping the rank at 25k give us incorrect data on who is stronger?
Are there lots of games where the handicap is recorded incorrectly?
...?

I feel like this discrepancy is important to understand...

anoek commented 10 months ago

Just to note, we're talking about the statistics we use to gauge how well the parameters we've chosen for our rating system, in particular our rating to ranking curve since those are the ones we really twiddle, are performing.

Here's my logic for discarding those other values:

If you have two players that are equal strength, then depending on komi, we expect black to win somewhere in the 50-56% range.

Ideally, if you have a player that is "3 ranks higher", aka "3 stones stronger" than their opponent, and they give their opponent a 3 stone handicap, then we'd expect black to win that same 50-56% of the time.

The primary goal of this repo is to tune the parameters used to fit the rating to ranking curve such that we minimize the divergence of the black win rate for handicap games from that of even games. That is to say, the win rates for all of our handicap 1,2 , 3 etc games where the ranks were 1, 2, 3 apart with white being the stronger player, those should all have black winning ~50-56% of the time. If we have a skew, say black winning 70% of the time or something, we know our ranks are not the right distance apart. For example, if black was winning too much, it would mean the ranks were too close because on average white is giving too much handicap for how strong their rank claims.

Now say we have a handicap 2 game but the rank difference is 5 with white being the stronger player, then we'd expect black to win with some small chance way less than 50%, hence why we don't include them in our statistics that we are using to measure how well our ranking system is at determining good handicap values - it'd just skew the results.

I could see an argument made that if the distribution of games where you had mismatched rank/handicap combinations was somewhat normal, then including those values would still improve the average, but I've been operating under the assumption that it's not normal and has a bit of structure to it since we have years of data of games played with different rating and ranking systems, so you've got bias coming from those systems, then you also have manually specified handicap games which, anecdotally, I suspect are bias towards not providing enough handicap. Hence, throwing all those values out for our purposes here.

dexonsmith commented 10 months ago

Now say we have a handicap 2 game but the rank difference is 5 with white being the stronger player, then we'd expect black to win with some small chance way less than 50%

I agree that is what we'd expect, but it's not what I'm seeing with OGS data. (With EGF and AGA data, that's indeed what we see...)

Instead, with OGS data, including those results makes the stronger player win less often than when excluding them. See the "stronger wins" column from the 8-row table above with an "ignore?" column.

"ignore?=ignore" rows exclude games like you mention. Stronger wins 69% of the time. (Code in-tree.)
"ignore?=include" rows include games like you mention. Stronger wins 62% of the time. (Commented out the "ignore" code.)

That's what I'm puzzled about. The "ignore?=include" rows should have stronger winning 80% (or something), not 62%.

dexonsmith commented 10 months ago

When I get a chance (maybe not until the weekend?) I'll reproduce and post a patch which adds a command-line option to include those results, so you can review and reproduce yourself.

anoek commented 10 months ago

Ok, as per usual I'm a little out of sync and adding to the confusion. There's a few stats. The one that I've pretty much exclusively cared about for optimizing on is the handicap performance, the first set of numbers. Excluding mismatched rank/handicap games from that is important. But the code you are talking about has nothing to do with this, so what I wrote above is moot, sorry for the confusion.

You're looking at the stronger player wins stat. That stat we're not targeting ~50%, I think we'd optimally target consistency, so 62% vs 69% isn't important, it's that it's 62% across the board or 69% across the board. That said, pretty sure I wasn't optimizing on that, it was more just a curiosity and bench marking thing I think.

But on to the actual question, what was the purpose of this block of code and should it remain there, and why when we remove it does our stronger win stat go down and not up?

        if abs(result.black_rank + result.game.handicap - result.white_rank) > 1:
            self.games_ignored += 1
            return

As for why it's there, it's probably there because I wanted those individual numbers to be comparable to one another and exclude some of the inherent bias that might exist of inadvertent or purposeful sandbagging or helium bagging.

There's also a published EGF stat https://en.wikipedia.org/wiki/Go_ranks_and_ratings that notes that in their rating system a 1k vs 2k in an even game has a 71.3% chance of winning, so while not directly comparable because we're using fractional rankings here and looking at 0 <= R < 1 as opposed to something like round(Black) + handicap - round(White) == 1, it's probably something of a sanity check I was using, noting that the value is somewhat close.

HOWEVER The real elephant in the room, the thing that doesn't pass the smell test, the bug - why when you remove that does our win rate for stronger players not go up? Pretty sure it's that the value being displayed is not in fact the stronger win rate at all like it should be but rather some number we get out of prediction_cost, so all those values are wrong. Specifically https://github.com/online-go/goratings/blob/master/analysis/util/TallyGameAnalytics.py#L144-L155 we're reading from prediction_cost instead of what I think should be predicted_outcome.

Changing that produces some results that align more with our intuition, also the expected values seem better.

As for if that condition should remain or not, I think that just comes down to what we're hoping to learn from that stat in the first place. It's interesting, but unclear if it's particularly useful for tuning.

dexonsmith commented 10 months ago

Thanks, that's helpful!

dexonsmith commented 10 months ago

A few updates.

2-3 more pull requests out:

56 fixes how tallying filters games to be based on effective rank difference (accounting for board size)
57 caps the way tallying uses the expected_win_probability for prediction cost to avoid the math domain error
Optional for this bug: #58 updates the math in expected_win_probability to Glicko-2

I had a look at "black wins" results with those merged.

Stats get closer to 50% for strong players.
Stats drift away a little for weaker players.

I think that's good? We expect handicaps to be most accurate for strong players, who play most consistently.

here's the baseline (after applying those patches): baseline.txt

here's the result with --handicap-rank-difference-{19x19,small} (after applying those patches): options.txt

online-go / goratings

Ratings should consider komi on small boards #45

51 landed, which is the basic options. It fails to run because of some very large effective handicap games. See the comments there on what to do about it. (Some of which were implemented in the abandoned #46 that it replaced.)

56 fixes how tallying filters games to be based on effective rank difference (accounting for board size)

57 caps the way tallying uses the `expected_win_probability` for prediction cost to avoid the math domain error

online-go / goratings

Ratings should consider komi on small boards #45

51 landed, which is the basic options. It fails to run because of some very large effective handicap games. See the comments there on what to do about it. (Some of which were implemented in the abandoned #46 that it replaced.)

56 fixes how tallying filters games to be based on effective rank difference (accounting for board size)

57 caps the way tallying uses the expected_win_probability for prediction cost to avoid the math domain error

57 caps the way tallying uses the `expected_win_probability` for prediction cost to avoid the math domain error