lightvector / KataGo

GTP engine and self-play learning in Go
https://katagotraining.org/
Other
3.52k stars 563 forks source link

E(score) policy head #354

Open lukaszlew opened 3 years ago

lukaszlew commented 3 years ago

I know ideas are cheap, but maybe it's useful so I'll leave it here. Playing self-games for P(win) is noisy at the end, yose and territory prediction is not precise. Perhaps training the network for E(score) is not a bad idea, here are possible benefits:

Perhaps this should be a separate policy head while sharing territory prediction head.

lightvector commented 3 years ago

Just checking - are you aware that KataGo already trains the network to predict the expected score, and gives partial weight to maximizing the score in addition to maximizing chance to win?

In other words, are you merely suggesting to switch MCTS to be 100% score-maximizing and 0% winrate-maximizing? Or are you proposing something else?

And there is a score-related component of the utility function that is adjusted dynamically, but it is not necessary and KataGo still plays fine in arbitrary komi games without it.

lightvector commented 3 years ago

Oh, I see the title of the thread - I guess you want to train a policy head to suggest score-maximizing policy. (sorry, this wasn't obvious from the contents of the post without the title, I missed reading the title).

Training such a policy head means having to generate data for it, which comes at a cost. The same MCTS search probably can't easily be used for both utility functions at the same time (score only, vs winrate blended with score), so each MCTS search you have to pick only one policy head to generate data for. You probably don't want to mix data within the same game either. Might be a bit more awkward to write the selfplay training loop.

samecos commented 3 years ago

If we do, we run two search methods at the same time when we play the game. One uses UCT and the other uses a selection algorithm similar to the epsilon-greedy algorithm. These data are all based on the same tree structure. Or we use two algorithms to search separately and allocate computing resources according to the set parameters. When the search is over, we combine the two search trees , and then we select the most visited node to put the stone. It feels more adventurous.

lightvector commented 3 years ago

Well, indeed talk is cheap. :)

I don't plan to do any experimentation along these lines for now, unless there is a more clearly-promising idea, or someone demonstrates that it works on, say, 13x13. Regarding score-maximization policy - even if everything works exactly as desired, the best outcome I can reasonably see from dropping the win-loss utility would be that KataGo will become better at maximizing score and worse at winning. (Because, if you say to care about score more and care about winning less... well... you will probably care about score more and care about winning less). And, since in real life, games and tournaments and everything else are far more focused on the basis of winning and not score, in real life this is undesirable.

And actually, KataGo already doesn't in practice seem to give away so many points while ahead or play senseless moves while behind like purely winrate-trained bots, so it seems to me there is not so much value to be gained. Many of the benefits listed are already implemented, for example we already finish games and get precise territory prediction data, we already don't need dynamic komi, etc. And some of the other benefits are not obvious - for example, I would imagine equally as much as variance goes down, it goes up as well - for every time that expected score makes you shy away from an risky fight that you could lose a lot or win a little, reducing variance, it will make the other player actively pursue the risky fight, increasing variance.

For running multiple search methods - yes, implementing different exploration formulas or criteria might be interesting. It seems incredibly messy to do separate searches and try to merge them though - how about just doing it in the same search tree but adding heuristic ways of walking down the tree on some playouts that has different modifications to the normal PUCT formula?

I do want to take a look at some alternative heuristics one could apply for exploration. Most of them will need help from the neural net though, via training new targets or integrating the net into the search in better ways. Anything "unsophisticated" like epsilon-greedy is likely to be terribly inefficient and harm the strength or performance significantly - it's easy to underestimate how disastrously bad "random" moves are, or moves driven by any unsophisticated hardcoded criteria - pre-neural-net bots were full of these approaches and most of them are useless in comparison to the neural net now.

lukaszlew commented 3 years ago

I'm sorry about not replaying earlier, I forgot I set up a gmail-filter for kataGo notifications.

I actually meant that I would love to have an engine which will maximize expected score completely and ignore P(win) and the-fore also ignore komi. If there is a way to set up the training for that? If so I can dedicate my GPU (and even buy a new one) for that.

I'm not sure what is the relevance of epsilon greedy, but there is a very good method to explore moves in E(score) setup - just add normal noise to expected score and choose the highest. But UCB should just work.

In real life people don't use KataGo to play that much (except bot vs bot competitions), they use KataGo to analyze their games. They even get commercial: ai-sensei.com, zbaduk.com
For analysis, and teaching P(win) is not that important I think, except for endgames of very close games like Sai vs Toya Koyo :)

lightvector commented 3 years ago

Maximizing score is actually not very good for analysis either. I'm aware of both ai-sensei and zbaduk, and it is easy sometimes to find users being misled by an over-focus on score then they use those tools.

If you have experience with bot analysis, you will very quickly find it to be a mistake to focus only on the moves that lost you the most points - because they will often be at times when (even at a human level) the game might have been close to over, already won/lost. And they will also often be fairly non-teachable moments, because they involve crazy tactics that are hard to understand and just mean that you need to practice reading. Whereas the simple shape mistake that you make 3 times per game that only loses you 1 point might be an "oh duh, even I recognize that's a more efficient shape in retrospect" that is easy to learn and integrate into your play.

I actually think you get better bot analysis by maximizing a blend of winrate and score, like KataGo does now. And potentially, using a slightly earlier/weaker network that is a little less sharp at score maximization, for some cases. When you are ahead, you get to see the bot demonstrate ways to play solidily and securely while still giving up hardly any points - so still very good style and no big risk of bad habits - but also not deliberately seeking horrible messy fights. When you are behind, you get to see the bot demonstrate ways of complicating, playing a bit faster, etc - ways you could have started difficulties for the opponent.

Even better would probably be to find some new metrics, or to blend analysis of different networks together, perhaps you could get the intuitions and move suggestions of the network less far beyond your level, but validated for quality and soundness by the stronger network. That would be fun to experiment with. But anyways, I don't think pure score maximization plays a large part in that.