lightvector / KataGo

GTP engine and self-play learning in Go
https://katagotraining.org/
Other
3.57k stars 564 forks source link

Combinatorial Game Theory - NN temperature head proposal. #363

Open lukaszlew opened 3 years ago

lukaszlew commented 3 years ago

Endgame in Go often decomposes into independent regions. Each region can have assigned mean and temperature as defined by CGT (Combinatorial Game Theory). Playing a region with maximal temperature is an excellent strategy. It would be great if we could ask NNs to learn to predict temperatures, since this is a modular metric (separate subgames do not influence each other).

There is a simple RL setup to generate data with temperatures known as 'Economist view'. I can define it if there is interest in evaluating and possibly pursuing this direction. We would have two networks returning real numbers. Policy head could be replaced or combined with temperature head.

Paper searches on CGT tend to focus on getting the very last point and have fairly scary focus on theory and scary notation. But the idea and definition of temperature (and in general thermograph) are very practical.

References:

lightvector commented 3 years ago

Is your thought that you would artificially construct a generator for positions that have the required decomposition to independent pieces? Otherwise you have the major problem of telling when a position has decomposed or not, and into how many pieces, and which spots on the board belong to which of those pieces. And in a nontrivial number of games, they don't decompose for a very long time, even far into the deep endgame, because of ko. (Yes, I know miai-counting theory and CGT can handle kos in many cases, I've read Bill Spight's posts, but a potential future ko hot enough to make the position 'hyperactive', then you often really get genuine global nonindependence).

lukaszlew commented 3 years ago

Sorry for late reply. I now disabled my gmail filter.

I did not mean to decompose the position precisely at all, I've meant to assume that the the game can be approximated by a sum of games, but calculate (as training data) only the temperature of the sum of the games - the whole board.

This would involve running global search on every training position both as black and white and treat the difference as the CGT temperature (label). The average ownership could be CGT mean ownership. This is sufficient for positions dominated by gote moves and maybe not a bad first approximation. I believe there is a nice way to take sente into account. (globally)

lukaszlew commented 3 years ago

Also playing on few boards simultaneously could be an easy way of providing a signal with better resolution.

lightvector commented 3 years ago

This would involve running global search on every training position both as black and white and treat the difference as the CGT temperature (label). The average ownership could be CGT mean ownership. This is sufficient for positions dominated by gote moves and maybe not a bad first approximation.

This seems dubious to me. The whole point of CGT is to be able to analyze a local subgame tree on its own, and furthermore to make fractional and even infinitesimal distinctions in its value that affect its arithmetic when added with other positions. So this "not a bad first approximation" actually seems like it defeats the entire purpose of CGT, right? Such a global target will lose all the rich fractional and infinitesimal information and information about different local subgames that is the entire value-added of CGT.

The neural net already has extensive training with estimating the value in a huge variety of whole-board-positions with a huge variety of player-to-move-first, in a way that is global and ignores local decomposition and fractional values and all other rich CGT-like information and instead is only just one numeric value. This is simply the training it already does all the time already.

So if you propose that you already be happy with a training output that is global and ignores local decomposition and fractional values and all other rich CGT-like information and instead is only just one numeric value, then I propose you just take the existing already-trained KataGo and make two calls to the neural net with each side to move and take the difference. It costs you at most a factor of 2, and you already have it without needing a new run.

lukaszlew commented 3 years ago

The accurate modelling of CGT offer (fractional values, infinitesimals for getting the last move, etc) are not very useful for the reasons that you identified.

What I'm proposing is using another part of CGT applicable to hot games where value of a move is larger than 1 point. I'm proposing to calculate the thermograph of the position and use it as a training data for the neural network. For instance mean and temperature are the coordinates of the tip of the termograph.

You don't need a decomposition into the subgames to calculate thermograph. One can do it on the whole board. Yet strategies based on thermographs compose well. E.g. play hottest game, add means, take the maximum of the temperature etc.

Importantly: the mean (per point ownership) does not depend on who is about to move, the termograph models both cases of white and black. And it takes into account whether it is sente or not.

I am hoping that if we ask DNN to learn thermographs and in particular stable ownership, it will be a slightly easier task than to predict globally best moves.

Also there is no way to calculate the temperature of the position using normal search right now.

lightvector commented 3 years ago

I suspect that anything like this is a sufficiently imperfect approximation that using them as a replacement will harm the strength significantly. Pure thermography doesn't do very well with tedomari, which is a big source of value in the late midgame and endgame, and also ko is a very big deal if you watch AI games - I would say even the majority of games have nontrivial kos affecting the results and creating global interactions.

Your proposal - run search for both sides and take the difference and/or average - also doesn't calculate the thermograph? It calculates a point value estimate for the temperature of the whole board. But the thermograph is about more than just the current difference between passing and not passing.

Why can't you calculate it right now? Just calculate it by exactly the same means that you propose to produce the training target. If all you want is the literal difference between passing and not, then run search in the current position and flip the player to move and run search, take the difference in predicted lead. For a slightly better result, do binary search on komi on each way. Yes this is some work, but this is exactly the work you would be doing anyways to produce this training target, so I don't see why you need the training target at all - just do it already with the neural nets right now.

lukaszlew commented 3 years ago

I see your points. As I understand thermography, I think the value of the tedomari can modeled well (i.e. add the temperature drop to the value of the move), and two-sided search trees are still valid (moves that are locally optimal are globally optimal), but I'm not 100% sure. Ko definitely indeed breaks that up.

Instead of trying to advertise a small step I'll take a step back and present my context and big picture and why I believe this is a good direction. I'm interested in your opinion on this topic and I'll ask at the end.

In my Go research I have two interesting results.

Result 1: I believe that a large part of Go is a game where separation into sub-games is meaningful. I believe that because when I play, this is the way I analyze the game. When I looked at the content of the search-tree in my engine (more than 10 years ago, where 3x3 playout patterns were a big thing ), I've seen that MCTS global search have been researching the same moves, say kikashi A and answer B, in most of all the sub-trees. I've concluded that it would be a good idea if the main tree was extended with local trees for such kikashi. I've attempted to model the global tree search by say sum of a global tree search plus small two-intersection sub-game. That was a literal summing: every move modified global tree, and maybe the sub-game tree. The value of a position was a sum of values in the subtrees. This was essentially linear regression with variables stored in tree nodes (with additional variance parameter on every variable, analogous to a global MCTS number of visits).

I could not get it to work. In the end I proved (and was very surprised) that it is impossible. While linear additive models can represent the optimal value function, they will diverge when updated with standard TD learning. I described it here in section 3. Of course you can just decrease the update parameter (say learning rate of the linear model) and effectively average out the oscillation and hope for the best, but this does not work well at all.

The root of the problem turned out to be that the linear model (sum of many game trees) cannot represent the policy: the order in which the subgames are played out.

I was able to fix it by summing the cgt-style games though. The thermograph approximation I used was having 4 parameters: mean, temperature, left and right stops. From a current search we can get only the left and right stops.

Since then I believe that thermographs or approximations of them are the right way to model positions. I would not be surprised if the net, before some of the the global pooling layers, encode the cgt-style information into the activation vectors and it rediscovered all of that.

Result 2 (first time I discuss it with someone else than Remi Coulom):

Before I have to finish my experiments I had one nice success: I associated with each 3x3 subarea of the board a separate game-tree (there was no global tree in that experiment). In other words for every location on board I had a small tree with up to 3^9 nodes with some cgt-style stats. This ensemble of trees was able to read out (explore) ~10 move ladder, propagate the result back to the first move, and choose and avoid chasing of a bad ladder. Due to the nature of the trees, the moves outside of all the 3x3 patters used in the ladder did not disrupt the knowledge, and the ladder would be played out perfectly in every playout.

There was no global tree. Please note that the entirety of the playout was using information from sub-trees, not only some prefix of the playout that fits in the global tree.

That's my context. I still believe that pursuing addition of sub-trees can reduce the exponential complexity and in-efficiencies of the global search. I am convinced that cgt-style values are essential and while NN might rediscovered them internally, they are not available to the tree search.

Perhaps it is better to try to give the general tool of "searching" to NN (I'm thinking of something like a sequence model but on trees instead of sequences), but I don't know how to do that. So I thought that exposing cgt-style values to existing search tree might be easier.

What do you think about these two results? What do you think about future of search in games?

lightvector commented 3 years ago

After thinking about it, I don't feel particularly excited by either of the two results you gave. My feeling is that to focus too much on these is to be fooled that one's simplified model is actually the reality, rather than a highly imperfect model. And one that gets increasingly imperfect the more you try to approach the highest levels, relatively speaking - because as you approach optimal, the gap between your model and the heavy simplifying assumptions it makes and reality becomes a larger and larger part of your error, which that you can never eliminate while working within that model.

The thing that I find amazing is how sensitive to context that neural nets are, and how much this affects their play at the highest levels. See https://lifein19x19.com/viewtopic.php?p=261738#p261738 for a very simple example of how context affects the play (note: life in 19's SSL cert appears to have expired yesterday but you can click through the browser warning to see the page).

Regarding your "ladder" result that seems again like an overly specialized case and unrepresentative of the general problem. Try playing with a position like image in Lizzie or similar positions arising out of various other pincer joseki. For the right, Black has options of R13 or S14 or wedge at Q14 or leave it alone. Also sometimes black will need to connect, with either S16 or S17. For the right, white may block S14, or leave it open, that is 2 options as well. For the top, black has Q17 or Q18. For the top, white may slide at R18, or N18, or M18. For the outside, White may jump O13, or not. If white does, black may or may not peep at P12. After white's jump at O13, white may want to attach on the outside at M15, or not, to induce black.

If you play through some variations, you will find that the valuation of these possibilities depends in a highly complicated way depending on which other moves have been exchanged or not. Every one of the moves I mentioned shows up as best or near-best in some variation, and as bad or inferior in another variation. Also, the preference between these possibilities changes if you vary who owns the adjacent corners or change the nature of ownership (different corner enclosures or extensions that stretch different lengths down the side towards this position). I don't think local patterns or valuations are sufficient to capture this complexity, themographs or not - your ladder case is a very special case, not the general case.

Also, almost every game between top bots (like, more than 90% of them) seems to involve ko or the threat of ko, often two or three of them over the course of the game. You can also sometimes the move suggestions change from KataGo if you stick a thousand-year-ko in the corner in the corner of the board - the policy across the entire rest of the board sometimes shifts and starts emphasizing on ko threat creation and destruction tactics in addition to normal endgame - again, showing some fascinating global context sensitivity.

And often by the time that the position is actually decomposable, a deep MCTS search can already determine the winner with good confidence - the game has already been decided before that, when the position was extremely non-trivial to decompose.

My overall impression is that focusing on human-engineered ways of composing and decomposing positions and modeling them as approximately independent is misguided. It feels to me like modern bots are well beyond the level where such methods can work - simplifying away any true global interactions or making simplifying assumptions about independence that aren't actually true of real midgame positions already introduces unacceptable levels of error. Or, if not, then they will be beyond that level soon.

lightvector commented 3 years ago

Or the short version: I suspect that given the level that bots are at, and the way that these top-level AI games actually seem to work, approaching search in this way feels like too much focus on analyzing "ideal spherical cow on a frictionless surface in a vacuum" and is too far away from reality to be easily useful.

You see this also with how RAVE, killer-move-like heuristics, and other heuristics are no longer so useful with these huge heavy neural nets. The neural nets are too much better than these heuristics already, even when these heuristics capture an essence of what the neural net still fails at - they are still too crude in all other ways to be an overall benefit!

I emphasized "human-engineered" above because I think there is still plenty of room for "non-human-engineered" techniques. i.e. ones that aren't some crude thing a human decided like "average result of same move location" or "3x3 pattern", or "thermograph ignoring ko with XYZ way to approximately decompose positions". I don't know what those techniques are though yet, that's for future research.

lukaszlew commented 3 years ago

The decomposition of search is indeed not always applicable. Usually one situation is hot enough to grab all the attention of the global tree search. Yet not so rarely there are positions where other part seems hot before searching and after searching it is determined cold. Without heuristics (NNs) large seki or barely killed dead group is an excellent example. If it is large enough there will be a 2-move search sequence searched at every node in the tree, (N*M complexity). NNs can remove many such situations but not all. The main technique in the existing engines is a bonus to local moves in MCTS. Does KataGo do that?

I am convinced that decomposition is the key to much better performance. But our discussion convinced me that if we are after strength, explicit CGT approaches are not the way to go. Transformer-like attention mechanism probably is. I.e. a way that the net itself can control the search and decomposition is probably better. Such thing does not exists but I have some ideas.

lightvector commented 3 years ago

Thanks, yes I agree that for at least just strength, capturing repeated structure in the tree is an interesting unsolved problem.

I've actually experimented with various biases to local moves and other hacks in KataGo, but I found nothing that produced an improvement. I've also tried things inspired by the killer move heuristic in terms of adjusting policy priors, but perhaps surprisingly, again found no improvement. I think again the issue is that even though for many X the neural net + MCTS architecture is clearly incapable of X (e.g. X = generalizing an unexpected good move across branches), it seems very often the net is still so good that attempting to hardcode a heuristic that adds X also adds so much false positive noise that it fails to produce an improvement. Techniques that used to be good in old MCTS engines before we had AlphaZero nets, despite their false positive noise, now that the baseline is so much stronger, are no longer good enough because of their false positive noise.

After many failed tries, the single successful thing I've found so far (recently!) is here: https://github.com/lightvector/KataGo/blob/master/docs/KataGoMethods.md#subtree-value-bias-correction But even having tried it myself and found finally one improvement, I'm not generally optimistic in the method as a long-term way for forward progress - it still feels quite fragile and un-general. Your point about giving the net "control" indeed seems more promising on a fundamental level.

Regarding analysis and aid for human understanding, separately from trying to improve strength and converge more efficiently to optimality - do you think it would still be interesting, first before trying to adjust the training method, to just go ahead and see if with existing nets evaluating the position and the ownership map with each side to play and taking the average already produces something at least a partial step closer to what you would find useful? And specifically see in practical usage what the biggest remaining flaws are? I don't see any reason why an AI tool or service couldn't already go ahead and send two queries and try this (e.g. katrain, or lizzie, or q5Go, or zbaduk, or ai sensei, whatever).

Edit: Or even, to send more than 2 queries, to do a "two-sided" search recursively down a few levels?

alreadydone commented 3 years ago

I agree that reusing computation across the search tree is the key to improvement in performance. I wonder if you've considered learning a paranoma representation of the whole board that can predict the policy distributions (and values) for multiple subsequent positions ("perturbations") with little additional computation. If we manage to do this with reasonable accuracy for 10 subsequent positions among the most likely variations, we could speed up playouts a lot (by 10x along non-branching subtrees of the search tree, but if the branching factor is larger then the variations are shallower, perturbations smaller, and predictions likely more accurate): certain nodes in the tree are evaluated as "root" positions and some as perturbations, and a new node is evaluated as a root iff its nearest ancestor evaluated as a root has 10+ visits.[1] Intuitively, one forward pass through the NN should be able to form a grand plan involving several subsequent moves instead just one move, possibly by implicitly decomposing the board into mostly independent regions. Some savings in computation is definitely possible, as the first few layers in a CNN doesn't change where it is far away from the moves.

The way I propose to do it is as follows: the current ResNet produces a [361,c'] tensor right before the policy/value heads. Now let c'=3c be a multiple of 3, so that for every intersection i on the board is represented by three c-dimensional vectors, called e_i, d_i1 and d_i2, where d_i1 and d_i2 correspond to the two other states intersection i could change to. Now given any subsequent position, we consider its difference with the root position, and sum up the corresponding d_i1 or d_i2 over all intersections i that have changed, yielding one c-dim vector d. Now for each i, let a_i = v·tanh(e_i + d) where v is a learnable c-dim vector, and do softmax over the a_i's to get the policy distribution. The value/score/ownership predictions can be dealt with similarly.

This is a form of attention and is essentially taken from the Pointer Networks paper; in the paper, d is a RNN latent state that encodes all previous moves and the order, but in Go the move history/order (except for a few previous moves in ko situations) is less important than the position on board, and I think summing up the representations of individual intersections captures the right inductive bias. However, this would be too linear and not expressive enough without the tanh, that a 1-dim representation does as good as a c-dim one. We may pass d through a shallow MLP to make the transformation more expressive, if necessary. I've also considered a_i = e_i·MLP(d) (the attention formula used in transformers) before revisiting the Pointer Networks paper, but have no idea which would work better.

The most obvious source of existing training data for such an architecture is positions taken from the same self-play game. The correlation of game results within the same game may be a problem, but KataGo's game branching should alleviate the problem. For training, the model would take two inputs: the original input planes, plus a [n,361,2] sparse 0-1 tensor representing indices of the d_i1 or d_i2 that are summed up, where n is the number of perturbed positions. We may train and check policy loss to see if the idea is viable.

[1] It seems an interesting mathematical problem to determine the asymptotic saving factor in terms of the average branching factor under different probabilistic models for the search tree, but that appears beyond my pay grade.

Back-of-the-envelope calculation and thought experiments

If we take c'=384, so c=128, we see that the c-dim vector d has the capacity to faithfully represent 128 different "types" of state changes or 128/2=64 "hot spots" on the board, or implicitly decompose the board into 64 regions, which seems more than enough.

The majority of moves is irrelevant to the main contest (doesn't affect the grand plan) and also gains no sente (requires no local response) and can be represented by a vector close to zero, with little effect on the policy distribution. Black and white's policy distribution are likely quite different, but KataGo's architecture already predicts both side's policy distribution from a single position, though its prediction of the second move isn't conditioned on the first move.

Some irrelevant sente moves that requires a local response might be represented by 1 of the 128 features, but once the exchange is done the feature could be cancelled out and the policy distribution may return to the original plan. The features could also represent different sizes of switches (in CGT sense) and different orders of infinitesimals, where the higher-order ones only become important when the lower-order ones cancel out.

"Equivalent" moves, e.g. in a capturing race or miai, could be given similar representations.

For the most important moves relevant the main contest, several responses and subplans could be hard-memorized in the representations.

lightvector commented 3 years ago

That's a fascinating idea. I think that that this sort of thing is unlikely to result in an improvement in the value function the way that actually playing out a position would (otherwise, at least for the best move(s) the value head should have been able to reduce the expected loss further by predicting it in the initial position too) but for accelerating the policy evaluations and expansion of the tree it's a very cool idea and might be promising to experiment with. Would need to think about it a bit more - there are a lot of messy parameters and hyperparameters to think about (as well as how to make MCTS work with it in a not-too-painful to implement way).