Open dexonsmith opened 11 months ago
I've started adding this to RatingsMath in a1486cb9b2fd27d9ccaa125896e2f7d54ff948ac, but the individual scripts still need an update to send the board size, ruleset, and komi into calculate_handicap
, so I can't test it out yet.
This also relates to #7, since the 19x19 version is essentially equivalent (at least, to the intent of #7, not sure about what's currently in the repo).
(The commit there uses the math from this WIP proposal: https://github.com/dexonsmith/online-go.com/blob/27bcbdc7699cf4fedc072336a4c36ab40897c876/doc/proposal-redesign-small-board-handicap-komi.md ... this is homework for the ratings part of the proposal.)
@anoek, the "ruleset" seems to be missing from the historical ratings database.
komi
in this database incorporate the handicap komi that AGA and Chinese rules add?(See also the PR #46.)
Also, there are some rated games in there with massive handicaps. E.g., this 8-stone 9x9 game raised an adjusted_handicap < 50
assertion:
Processing OGS data
243,876 / 15,123,682 games processed. 92.3s remaining
size = 9
komi = -2.5
handicap = 8
komi_bonus = 0
adjusted_handicap = 52.25
Indeed, an 8-stone handicap on a 9x9 board is a big advantage. Seems unnecessary to rate this game at all...
anoek doesn't change the ratings calculations lightly. He also usually handles that himself and does a lot of simming before pushing anything. Just a heads up before you put too much work into this ❤️ love the work you've been doing recently, wouldn't want you to get discouraged 😛
anoek doesn't change the ratings calculations lightly. He also usually handles that himself and does a lot of simming before pushing anything. Just a heads up before you put too much work into this ❤️ love the work you've been doing recently, wouldn't want you to get discouraged 😛
Thanks for the heads up :). Already chatted with him about this, and we need to do something for small boards. Ratings adjustments are currently (and perhaps have always?) treating the small board handicaps as "1-stone == 1-rank" (as-if 19x19), which is completely haywire.
Cool sounds good 👍 good luck ❤️ I agree 1 stone per rank on small boards is bonkers haha
Some data from running ./analysis/analyze_glicko2_one_game_at_a_time.py
(hardcoding "japanese"
):
compute-handicap-via-komi-baseline.txt
compute-handicap-via-komi-small.txt
compute-handicap-via-komi-19x19.txt
compute-handicap-via-komi-small+19x19.txt
Haven't looked closely, since I'm not sure I'll know how to interpret it.
Always dies for me with this traceback:
Traceback (most recent call last):
File "/Users/dexonsmith/Repos/online-go/goratings/analysis/analyze_glicko2_one_game_at_a_time.py", line 116, in <module>
tally.print()
File "/Users/dexonsmith/Repos/online-go/goratings/analysis/util/TallyGameAnalytics.py", line 140, in print
self.print_self_reported_stats()
File "/Users/dexonsmith/Repos/online-go/goratings/analysis/util/TallyGameAnalytics.py", line 285, in print_self_reported_stats
stats = self.get_self_reported_stats()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dexonsmith/Repos/online-go/goratings/analysis/util/TallyGameAnalytics.py", line 339, in get_self_reported_stats
raise Exception('Failed to find self_repoted_account_links json file')
Exception: Failed to find self_repoted_account_links json file
Here are the compact stats:
| Algorithm name | Stronger wins | h0 | h1 | h2 |
|:---------------|--------------:|---:|---:|--------------:|
| glicko2_one_ga | 68.3% | 68.5% | 67.4% | 67.0% | baseline
| glicko2_one_ga | 68.4% | 68.6% | 67.5% | 67.4% | small
| glicko2_one_ga | 69.0% | 68.7% | 67.9% | 71.9% | 19x19
| glicko2_one_ga | 69.0% | 68.7% | 67.9% | 70.9% | both
Interesting to see a modest improvement in the compact data for 19x19 but not much for small boards... could be the new math isn't quite right, or maybe for some (or all?) of the data the "handicap" value is storing a "handicap rank difference" (not stones) after all.
Other scripts available as of 330bba9e053455f7c47ea3162833d0b732da3967 (I didn't test them but think the updates are correct).
Also thought of another two possibilities:
After 080b385, "both" gets:
| Algorithm name | Stronger wins | h0 | h1 | h2 |
|:---------------|--------------:|---:|---:|--------------:|
| glicko2_one_ga | 69.0% | 68.7% | 67.8% | 70.8% |
Pretty similar.
One thing I'd like to do is "skip" some games as unrate-able (i.e., does not affect the rating), say if the rank adjustment is bigger than 20. Or 9. Or something. Not sure there's an easy way to do that right now though.
One thing I'd like to do is "skip" some games as unrate-able (i.e., does not affect the rating), say if the rank adjustment is bigger than 20. Or 9. Or something. Not sure there's an easy way to do that right now though.
I guess the way to do this is to skip them in the caller.
handicap > 9
(shouldn't be any in the DB...)handicap > 5
(effective rank diff of 10-15, assuming scaling factor of 2.5-3x)handicap > 3
(effective rank diff of 12-18, assuming scaling factor of 4-6x... maybe even > 2
would be better)I'd want to skip these games both for the purposes of:
@anoek, I haven't looked yet at how to do this yet (hopefully I'll pry myself away and won't get to it a few days), but curious if you have thoughts on (a) how this should be structured in the goratings code and (b) whether the number of skipped games is worth printing stats on.
Might also be interesting to see what happens to ratings if ALL small board handicap games are skipped... not the right end result, but a useful baseline.
Yeah given how bad the handicaps are for 9x9 and 13x13, I too am tempted to just throw those away too. Might be worth the experiment to consider them like you're proposing too to see if it can be useful, but it might just be a detriment.
Looks like EGF and AGA datasets just have 19x19 games. Here are the compact results from running on them:
Algorithm name | Stronger wins | h0 | h1 | h2 | dataset / options |
---|---|---|---|---|---|
glicko2_one_ga | 68.5% | 69.1% | 68.2% | 66.5% | aga baseline |
glicko2_one_ga | 70.0% | 69.7% | 68.7% | 71.1% | aga this branch |
glicko2_one_ga | 69.5% | 68.7% | 68.6% | 72.0% | egf baseline |
glicko2_one_ga | 67.9% | 67.8% | 68.4% | 69.6% | egf this branch |
Again, assuming Japanese rules for all of them.
What rules does EGF use?
AGA uses AGA rules, EGF I think uses Japanese? There might be some flexibility for both organizations, I'm not entirely sure.
Do you trust the komi values in the AGA and EGF datasets?
Note that the following assertion passes for all games in both datasets:
assert game.komi == 0
(Makes sense for handicap games, but I imagine they use komi for even games?)
Yep I would trust them
Interesting. There are a couple of reasons for games to be ignored in the tallies
if (result.black_deviation > PROVISIONAL_DEVIATION_CUTOFF or
result.white_deviation > PROVISIONAL_DEVIATION_CUTOFF):
self.games_ignored += 1
return
if abs(result.black_rank + result.game.handicap - result.white_rank) > 1:
self.games_ignored += 1
return
If I comment that out, I get these results for --aga (hardcoding AGA rules and komi=7.5 for even) |
Algorithm name | Stronger wins | h0 | h1 | h2 | options |
---|---|---|---|---|---|---|
glicko2_one_ga | 68.6% | 68.2% | 68.2% | 69.8% | baseline | |
glicko2_one_ga | 79.6% | 72.3% | 74.5% | 83.0% | this branch |
git-blame tells me it has been that way since the initial commit in 68d0fff9. Do you remember why we're ignoring those games?
Yep I would trust them
But don't AGA rules say they use komi for even games? Do they skip that in tournaments?
Yep, I might be confused here, are you seeing zeros for komi? That'd be weird. I thought your prior assertion ensured there were no zeros, if there's a value in there I have no reason not to trust them the komi values they provided, as I understand it these are database dumps directly from each organizations datrabases. If there are zeros, then that's not accurate I'd wager, but if they provide a komi I reckon that's as accurate as it can be given that there are probably humans filling out a lot of those values based on whatever tournaments the game came from.
Okay, this is:
--ogs
(hardcoding Japanese, trusting komi)PROVISIONAL_DEVIATION_CUTOFF
(as in HEAD)EDIT: actually, I lost track of which data this is. Re-running.
Yep, I might be confused here, are you seeing zeros for komi? That'd be weird. I thought your prior assertion ensured there were no zeros, if there's a value in there I have no reason not to trust them the komi values they provided, as I understand it these are database dumps directly from each organizations datrabases. If there are zeros, then that's not accurate I'd wager, but if they provide a komi I reckon that's as accurate as it can be given that there are probably humans filling out a lot of those values based on whatever tournaments the game came from.
Yeah, both --aga
and --egf
have all zeroes for komi.
Okay, this is:
--ogs
(hardcoding Japanese, trusting komi)- ignoring
PROVISIONAL_DEVIATION_CUTOFF
(as in HEAD)- NOT ignoring effective handicap bigger than 1
EDIT: actually, I lost track of which data this is. Re-running.
Yeah, I had the wrong code commented out when I was just doing one of them, and got it backwards.
PROVISIONAL_DEVIATION_CUTOFF
has very little effect.Still interesting in why these are being ignored.
(For looking at improvements to the small board analysis, I definitely need to look at effective handicap bigger than 1)
Update: you can ignore all my "baseline" numbers above :/. In my first commit on the branch, I somehow (???) corrupted the get_handicap_adjustment
that the baseline measurements use. Reverted that mistake in 7435f266b7b60f906f6bac97a405b67725da1404. Haven't rerun numbers yet.
Haven't rerun numbers yet.
As of 608e551, skipping rating games with effective handicaps bigger than 9.
New numbers:
Algorithm name | Stronger wins | h0 | h1 | h2 | dataset | options |
---|---|---|---|---|---|---|
glicko2_one_ga | 68.8% | 68.6% | 68.7% | 69.3% | ogs | baseline |
glicko2_one_ga | 69.0% | 68.7% | 67.8% | 71.2% | ogs | this branch |
if abs(result.black_rank + result.game.handicap - result.white_rank) > 1:
self.games_ignored += 1
return
This code, only used for the analytics part, says
If this is a game between two people with an appropriate handicap (as judged by the outcome of our choice of rating parameters this run) - then include it in the stats. Otherwise discard the result for the purpose of our stats.
In other words, if in our game history we had a 2 stone handicap game between a 5d and a 1kyu, we don't want to tally this into our stats because it'll skew the results. The purpose of these stats is to judge how good our curve fit between rating and ranking is, so to do that we're optimizing on minimizing then difference in win rates between our handicap games and our non handicap games when the appropriate handicap is used in a game.
if abs(result.black_rank + result.game.handicap - result.white_rank) > 1: self.games_ignored += 1 return
This code, only used for the analytics part, says
If this is a game between two people with an appropriate handicap (as judged by the outcome of our choice of rating parameters this run) - then include it in the stats. Otherwise discard the result for the purpose of our stats.
In other words, if in our game history we had a 2 stone handicap game between a 5d and a 1kyu, we don't want to tally this into our stats because it'll skew the results. The purpose of these stats is to judge how good our curve fit between rating and ranking is, so to do that we're optimizing on minimizing then difference in win rates between our handicap games and our non handicap games when the appropriate handicap is used in a game.
Okay, that makes sense. I see the value in seeing the curve fit with that data excluded.
But, if that data is just ignored (and we don't look at it anywhere), I feel like it can hide problems.
E.g., when I uncomment it on --ogs
, it LOWERS the win rate of the stronger player. Wouldn't we expect including that data to increase the win rate? (Or maybe I'm not understanding what the "stronger player" metric is.)
Here's the data (re-run, since I don't trust the stuff I printed before finding the weird bug I inserted in the baseline):
Algorithm name | Stronger wins | h0 | h1 | h2 | options | ignore? | judgement |
---|---|---|---|---|---|---|---|
glicko2_one_ga | 68.8% | 68.6% | 68.7% | 69.3% | baseline | ignore | stones |
glicko2_one_ga | 68.8% | 68.6% | 68.7% | 69.3% | baseline | ignore | rank difference |
glicko2_one_ga | 69.0% | 68.7% | 67.8% | 71.2% | this branch | ignore | stones |
glicko2_one_ga | 68.9% | 68.7% | 68.5% | 70.4% | this branch | ignore | rank difference |
glicko2_one_ga | 61.8% | 60.4% | 67.6% | 68.6% | baseline | include | stones |
glicko2_one_ga | 61.8% | 60.4% | 67.6% | 68.6% | baseline | include | rank difference |
glicko2_one_ga | 61.9% | 60.4% | 66.8% | 70.3% | this branch | include | stones |
glicko2_one_ga | 61.9% | 60.4% | 66.8% | 70.3% | this branch | include | rank difference |
Although, come to think of it, maybe the problem is the 9x9 and 13x13 games. Let me see what happens if I ignore based on computed handicap_rank_difference
and report back.
EDIT: updated with the data, where the judgement is based on handicap_rank_difference
(6b5a96b). Not much difference.
Also tried adding --size=19
to the command line, just to totally exclude small boards. Still seeing LOWER rates for "stronger wins" when including badly mismatched games. This just doesn't make sense to me. When games are badly mismatched, the stronger player should almost always win.
Interestingly, also found that compact stats are completely ignoring small boards:
prediction = (
self.prediction_cost[19][ALL][ALL][ALL] / max(1, self.count[19][ALL][ALL][ALL])
)
prediction_h0 = (
self.prediction_cost[19][ALL][ALL][0] / max(1, self.count[19][ALL][ALL][0])
)
prediction_h1 = (
self.prediction_cost[19][ALL][ALL][1] / max(1, self.count[19][ALL][ALL][1])
)
prediction_h2 = (
self.prediction_cost[19][ALL][ALL][2] / max(1, self.count[19][ALL][ALL][2])
)
Interestingly, also found that compact stats are completely ignoring small boards:
This goes back to the initial commit in 51c291abe65ae2239468b6eb4afa60977d8ac846.
Trying to page this back in after a few weeks away.
I'm still super curious about the following:
--size=19
... but I should double-check, since it has been a few weeks.That suggests to me some sort of fundamental problem/weirdness with the OGS data.
I feel like this discrepancy is important to understand...
Just to note, we're talking about the statistics we use to gauge how well the parameters we've chosen for our rating system, in particular our rating to ranking curve since those are the ones we really twiddle, are performing.
Here's my logic for discarding those other values:
If you have two players that are equal strength, then depending on komi, we expect black to win somewhere in the 50-56% range.
Ideally, if you have a player that is "3 ranks higher", aka "3 stones stronger" than their opponent, and they give their opponent a 3 stone handicap, then we'd expect black to win that same 50-56% of the time.
The primary goal of this repo is to tune the parameters used to fit the rating to ranking curve such that we minimize the divergence of the black win rate for handicap games from that of even games. That is to say, the win rates for all of our handicap 1,2 , 3 etc games where the ranks were 1, 2, 3 apart with white being the stronger player, those should all have black winning ~50-56% of the time. If we have a skew, say black winning 70% of the time or something, we know our ranks are not the right distance apart. For example, if black was winning too much, it would mean the ranks were too close because on average white is giving too much handicap for how strong their rank claims.
Now say we have a handicap 2 game but the rank difference is 5 with white being the stronger player, then we'd expect black to win with some small chance way less than 50%, hence why we don't include them in our statistics that we are using to measure how well our ranking system is at determining good handicap values - it'd just skew the results.
I could see an argument made that if the distribution of games where you had mismatched rank/handicap combinations was somewhat normal, then including those values would still improve the average, but I've been operating under the assumption that it's not normal and has a bit of structure to it since we have years of data of games played with different rating and ranking systems, so you've got bias coming from those systems, then you also have manually specified handicap games which, anecdotally, I suspect are bias towards not providing enough handicap. Hence, throwing all those values out for our purposes here.
Now say we have a handicap 2 game but the rank difference is 5 with white being the stronger player, then we'd expect black to win with some small chance way less than 50%
I agree that is what we'd expect, but it's not what I'm seeing with OGS data. (With EGF and AGA data, that's indeed what we see...)
Instead, with OGS data, including those results makes the stronger player win less often than when excluding them. See the "stronger wins" column from the 8-row table above with an "ignore?" column.
That's what I'm puzzled about. The "ignore?=include" rows should have stronger winning 80% (or something), not 62%.
When I get a chance (maybe not until the weekend?) I'll reproduce and post a patch which adds a command-line option to include those results, so you can review and reproduce yourself.
Ok, as per usual I'm a little out of sync and adding to the confusion. There's a few stats. The one that I've pretty much exclusively cared about for optimizing on is the handicap performance, the first set of numbers. Excluding mismatched rank/handicap games from that is important. But the code you are talking about has nothing to do with this, so what I wrote above is moot, sorry for the confusion.
You're looking at the stronger player wins stat. That stat we're not targeting ~50%, I think we'd optimally target consistency, so 62% vs 69% isn't important, it's that it's 62% across the board or 69% across the board. That said, pretty sure I wasn't optimizing on that, it was more just a curiosity and bench marking thing I think.
But on to the actual question, what was the purpose of this block of code and should it remain there, and why when we remove it does our stronger win stat go down and not up?
if abs(result.black_rank + result.game.handicap - result.white_rank) > 1:
self.games_ignored += 1
return
As for why it's there, it's probably there because I wanted those individual numbers to be comparable to one another and exclude some of the inherent bias that might exist of inadvertent or purposeful sandbagging or helium bagging.
There's also a published EGF stat https://en.wikipedia.org/wiki/Go_ranks_and_ratings that notes that in their rating system a 1k vs 2k in an even game has a 71.3% chance of winning, so while not directly comparable because we're using fractional rankings here and looking at 0 <= R < 1
as opposed to something like round(Black) + handicap - round(White) == 1
, it's probably something of a sanity check I was using, noting that the value is somewhat close.
HOWEVER The real elephant in the room, the thing that doesn't pass the smell test, the bug - why when you remove that does our win rate for stronger players not go up? Pretty sure it's that the value being displayed is not in fact the stronger win rate at all like it should be but rather some number we get out of prediction_cost
, so all those values are wrong. Specifically https://github.com/online-go/goratings/blob/master/analysis/util/TallyGameAnalytics.py#L144-L155 we're reading from prediction_cost
instead of what I think should be predicted_outcome
.
Changing that produces some results that align more with our intuition, also the expected values seem better.
As for if that condition should remain or not, I think that just comes down to what we're hoping to learn from that stat in the first place. It's interesting, but unclear if it's particularly useful for tuning.
Thanks, that's helpful!
A few updates.
2-3 more pull requests out:
expected_win_probability
for prediction cost to avoid the math domain errorexpected_win_probability
to Glicko-2I had a look at "black wins" results with those merged.
I think that's good? We expect handicaps to be most accurate for strong players, who play most consistently.
here's the baseline (after applying those patches): baseline.txt
here's the result with --handicap-rank-difference-{19x19,small}
(after applying those patches):
options.txt
Ratings currently don't consider komi on small boards, but should, since usually it's the komi that changes (not the number of handicap stones) as handicap increases.