lightvector / KataGo

GTP engine and self-play learning in Go
https://katagotraining.org/
Other
3.59k stars 568 forks source link

Winrate: Katago vs. Alphago #740

Open ashinpan opened 1 year ago

ashinpan commented 1 year ago

This is not an actual issue, but a question that I am curious about.

At the Alphago Teach web app, the initial winrate for black is 46.8%. It means that white has the initial winrate of 53.2%.

On the other hand, Katago has around 61% for the initial winrate of white (as seen from Katrain).

So, there is a large discrepancy between the evaluations of two engines. Why? Because they are using different algorithms?

lightvector commented 1 year ago

Thanks, yes this is an interesting observation!

So firstly, let me answer a related question - why is it generally so large?

The current belief among many experts and many people who have worked on computer Go is that 7.5 komi is slightly too large under area scoring, and gives an advantage to White. A lot of this is guessed from computer winrates, but it isn't solely computer play that provides this evidence, you can also give heuristic arguments.

Under area scoring, it is expected that the komi that would result in a draw under optimal play is most likely an odd integer (the game can normally only end in an even score if there is a seki with an odd number of dame, and this is the clear minority of games). If this guess is correct, then it is very likely 7. Since 5 is too small - it conflicts with historical pro-pro game statistics that 5.5 komi is too low, and is extremely hard to reconcile with miai-counting theory because it would imply that many opening moves have values are implausibly small. And 9 is too large, it is larger than current komis in pro games that appear to fair, and again via miai-counting theory would imply that some opening moves are worth more relative to some moves with known quantifiable values than most pros or bots would judge.

Anyways, if it is true that 7 is fair, then over time as computer play improves, you would expect winrates for komi 7.5 to increasingly favor white, as the quality of play rises and makes the half-point advantage more likely to be decisive. And we see that. KataGo's reported winrate for white under komi 7.5 has very slowly increased over time. In earlier parts of distributed training, it was closer to 60% instead of 61%. Before distributed training started, it was around 59%. And around when KataGo was transitioning to 40-block nets and starting to surpass Leela Zero, it was around 58%.

Okay, now a second question is - why is AlphaGo Teach not larger?

One answer is probably that very small differences in hyperparameters and self-play settings can affect this value, in a way doesn't exactly match strength. For example, suppose you have a bot that in self-play uses much higher noise and/or exploration in the early opening, increasing the chance of opening moves that might give up that tiny advantage. With appropriate and sufficient training, the bot could still be equally strong when run in matches when you turn off the extra noise, but since the neural net learns its evaluations from self-play, and in self-play the noisy moves in the game are likely to give up the slight advantage, the winrate would be much closer to 50%. And vice versa, the opposite could happen if parameters are tuned the other way. Also, winrate sharpness could be affected by other things (resign thresholds, search parameters, data weighting).

One bot that comes to mind is ELF OpenGo, which in the past was notorious for having very sharp evaluations, with winrates that swung a lot based on even tiny advantages and disadvantages while bots stronger than it would give much more relaxed evaluations. So winrate sharpness definitely varies between bots for reasons other than just strength.

However, I think there's one more factor. If you play around with the AlphaGo teach website, you find that the winrates are not that consistent between moves. For example, tengen on the first move is 41.5%, but if you click on it, then white's two best listed replies score 39.6% and 40.0%. Now, it's normal for bots to change their judgment between moves - one move later, they search deeper, so their judgment updates. However, this is on the large end of what you would expect ten million playouts of search to do for very early opening searches, when there aren't sharp tactics yet that could swing the judgment a lot. Indeed, it seems pretty common that when you click on a move, you find that the move's winrate is somewhere well between the winrate of the best reply of the opponent and that of worse moves. So I'd guess what happened is that when they generated this data, they also used a fairly high exploration parameter in the MCTS. By exploring a lot and sending lots of playouts down moves that aren't as good and using those playouts as part of the winrate average, this would cause the search to "blur" the evaluations a bit, and one would also expect this to pull the winrates closer slightly to 50% on average.

Anyways, I'd say those are the main factors involved. And, all else equal, if the guess about 7 komi being fair is right, we should expect white's winrate under 7.5 komi for KataGo to keep growing in the future, very very slowly.

ashinpan commented 1 year ago

@lightvector Thanks a lot for your patient explanation. Let us wait and see, then.

hwj-111 commented 1 year ago

I also thought this question long time ago (during leelaZero era). Here is my post: https://github.com/leela-zero/leela-zero/issues/2486