lightvector / KataGo

GTP engine and self-play learning in Go
https://katagotraining.org/
Other
3.47k stars 563 forks source link

Updated policy after tree search? #293

Open egri-nagy opened 4 years ago

egri-nagy commented 4 years ago

I'm using the Analysis Engine to get the policy for board positions. The documentation says that the policy is an array

with positive values summing to 1 indicating the neural network's prediction of the best move before any search,

however the table is different for maxVisits=1 and maxVisits=10000, indicating that the tree search changes the policy - which I need. I would expect the raw network policy for maxVisits=1, and the search refined policy for higher number of visits.

Could you please confirm that this is the actual behaviour? Or is the policy independent of the search as the documentation suggests?

lightvector commented 4 years ago

The search does not change the raw policy. Probably what you are experiencing is the random variation in the raw policy due to symmetries. Every time a position is evaluated, one of the 8 symmetries of the neural net is randomly chosen - the board is randomly rotated or reflected before being fed to the net, then the output of the net is rotated or reflected back.

The neural net may very well return different values when looking at different symmetries of the board - it is essentially as if one didn't perform random symmetries but instead trained 8 separate neural nets independently on the same data. Randomizing evaluations between the "8 neural nets" helps ensemble their overall sentiment about the value of position which may improve evaluation accuracy in longer searches. And in actual play, it helps with randomization (so that KataGo doesn't always play the exact same game every time) at probably no loss of strength - since a-priori, there's no reason to expect any of these 8 nets to be much weaker or stronger than any of the others.

egri-nagy commented 4 years ago

I see, buy 1 get 7 free. This makes good sense. Thank you very much for the prompt answer and the clarification!

So, back to my original problem, if I want to compare the raw policy with the result of the search, can I take the output of includePolicy and compare that with the winrate in moveInfos. I mean restricting the comparison to the top n moves. Does it make sense? I'm trying to measure the added value of the search directly, not in a holistic way (e.g. Elo points).

lightvector commented 4 years ago

What does "added value" mean?

The move with the highest winrate is not always the move that KataGo believes is best. Sometimes a high winrate can occur simply due to noise and if the move were explored more then it would be revealed as not actually a high winrate move - so in actual play, "highest winrate" is not the criterion for choosing a move. Also, sometimes KataGo will choose a move with a very slightly lower winrate but that leads to a better final score. Sometimes it will choose a move that leads to a slightly worse final score but that it is more confident about it winning. The actual function for choosing the final move is a bit complicated and you should not attempt to compute it yourself - instead, whichever move has the field "order" equal to 0 is KataGo's top choice.

Also, of course, "winrate" and "policy" are in different units and serve different purposes, so comparing them is sort of a type error. And if we ignore the above detail (transient high winrate due to noise) it may even still be "correct" and "optimal" for them to sort some of the moves beyond the first one in a different order from each other.

You probably want to be clearer about exactly what "added value" means and what your purpose is. :)

egri-nagy commented 4 years ago

Indeed, the term "added value" is vague. So let me describe the intuitive idea, which I'm looking to implement. This might be nebulous itself, I'll try.

Ultimately I'm trying to measure how close the network is to perfect play, or at least to the local minimum it dug itself in. One way to do this is checking the learning rate dropping - I guess this is what you did when you declared the networks to be final. What I'd like to do is to measure this more directly.

As a rough analogy to human thinking, the raw policy of the network is intuition, while the search is calculation (step-by-step reading, logical calculation). The better intuition you have, the less calculation you need to do. So, if the network gets really good, then the search does not provide any information gain (this is what I meant by added value). Whether this limit case is possible or not, that's another interesting question. I simply want to measure the disparity between the probability distributions of good moves before and after the search.

To implement this, I took the updated policy (this is wrong as you pointed out), and got the top 10 moves and forgot the rest, and normalized it, this probability distribution Q. Then looked at raw policy for those 10 moves, forgot the rest, normalized it, this P. Then computed the Kullback-Leibler divergence of P and Q, i.e. the information gain when using Q instead of P. At that point I realized the problem of not having an updated policy.

You say it is complicated to compute the actual function for choosing the moves, but it seems that that is the way to go, so could you please point to me the relevant part in the source code? I need to construct the probability distribution for the top 10 (or some other number) moves, just as if I wanted to do random sampling for these moves.

Sorry for being vague, I hope this is a bit more clear now.

egri-nagy commented 4 years ago

As a side comment, I am fully (perhaps painfully) aware of the intricacies of choosing move, you described. I have this new hobby of handcrafting `perfect ' games, by choosing moves manually after heavy analysis (often hours long on RTX2070, with restarts). I balance the komi in the beginning, then try to choose the best possible moves for each side. Interesting common feature of those games is that the winrate goes to one side in the 3rd part of the game, but the scorelead remains balanced. Here is one example. Screenshot from 2020-08-09 11-38-40 And here is a replay of another game. https://youtu.be/abnQIItZgDg As you see, it's all about aiming for perfection. :)

lightvector commented 4 years ago

Comparing the policy to the final move chosen by the bot is a good way to get some statistics about "how often does search result in a different move being selected". So it's good for that objective. But it tells you very little if anything about the distance to perfect play, so it's probably pretty bad for this latter objective. The amount that search adds has very little to do with the distance to optimal (which is vastly larger than what is added by the search) and much more to do with just the search itself. Or maybe to rephrase: trying to measure the "added value" of the search is a great idea to quantify how good the search is.... but the "added value of the search" doesn't say much about how much farther it is to optimal beyond that (namely, a lot).

For distance to optimal, you're better off looking for curvature in the relationship between Elo ratings of for various komi offsets or for playout doublings - Friday9i did a lot of interesting work here. You can extrapolate by looking at how as the level of play continues increasing, the same actual-measured winning chance (i.e. Elo rating) grows more extreme for the same amount of komi handicap, or similar things.

Does that make sense?

As for the entire notion of a "updated policy" - one tricky thing that I think you might be underappreciating - modeling the final result of the search as a probability distribution is actually a little ill-founded. For this kind of purpose, you should probably just treat the result of the search as deterministic - the order 0 move is the result of the search, and that's it. So if you want to compute a KL-divergence, the "updated policy" would just be the one-hot distribution that has 100% of its mass on the move whose reported order is 0. Which amounts to just the average log policy of the order 0 move.

Why do I say this? Because the only reason that the bot plays nondeterministically is for reasons that have very little to do with the strength of the bot and more to do with arbitrary engineering choices or practical desires to explicitly introduce randomness. In practice, KataGo does randomize a little according to a probability distribution at the end of search, but whether that probability distribution is extremely sharp or whether it's somewhat more diffuse has almost nothing to do with the strength of the search. It's entirely governed by an arbitrarily chosen parameter "chosenMoveTemperature" whose optimal value is 0 (i.e. choose the order 0 move deterministically), except for the fact that it would make KataGo tend to just play the same game over and over if the opponent also played the same moves. So while 0 is the strongest possible setting if facing that opponent for only one game, or if the opponent could be relied on to provide entropy so that KataGo didn't need to, in practice we set a nonzero temperature so as to introduce a bit of variety into play even if the opponent tries to repeat a game. But the point is that this is an arbitrary post-search choice.

What about the random 8 symmetries, or multithreading? Don't they introduce nondeterminism and make KataGo have a well-defined notion of a post-search distribution, which you can at least sample from by repeatedly re-searching a position? Sort of. But again, this is more fragile implementation details. If the rotations were keyed off of the zobrist hash of the position (which I well could have implemented instead of what it does now), the strength would be indistinguishable, but it would be deterministic in this regard. Or if batching operated in coordinated waves (I think ELF OpenGo's native engine actually does something like this), then again the strength and overall "value" contributed by the search would be about the same, but now the result would be deterministic even with respect to batching/threading.

Basically - the point of all of the above is mostly that to the degree that there is any meaningful probability distribution besides just "one hot on the order 0 move", it seems to me that the nature of that distribution is governed all by arbitrary choices that are driven by external considerations or by arbitrary implementation details have nothing to do with the "value added" by the search that you are interested in and that would be almost indistinguishable if done another way. So you might as well just use "one hot on the order 0 move".

(Edit: I actually do think there is a bit of room for modeling the search result as a distribution. So I'm partly playing devil's advocate here. But even so, it's tricky - because there's not an obvious objective "ground truth" that it's a distribution over - which is the root of some of the above model issues too, at a high level).

egri-nagy commented 4 years ago

Thank you for these insights! I see the danger of implementation details influencing the probability distribution after search, which needs to be considered. Still, I think there is also useful signal in that distribution. Trivially, forced moves have a single peak, while commuting' moves (same winning chance, but the order doesn't matter, there are many of these in the endgame). What interests me is the moment when the botchanges its mind', the search finds something that changes the best move, so it goes against the `intuition', the raw network policy. I see this frequently in ongoing analysis (millions of visits). The thinking goes, that if the network is stronger, than we would observe these changes less frequently. Well, I may still need to think about how to express this thinking properly.

Certainly the hitrate is less controversial to measure, here is a draft paper, Table 1 has some experimental results.https://arxiv.org/abs/2009.01606

Also, could you please give a pointer to Friday9i's work you mentioned?