glinscott / leela-chess

**MOVED TO https://github.com/LeelaChessZero/leela-chess ** A chess adaption of GCP's Leela Zero
http://lczero.org
GNU General Public License v3.0
760 stars 301 forks source link

0 visit children are not "proportionally" chosen. Laplacian adjustment may be needed. #559

Open Videodr0me opened 6 years ago

Videodr0me commented 6 years ago

Temperature should proportionally affect all moves (UCTNode::randomize_first_proportionally). Curently 0 visit children are not proportionally adjusted. This might be a minor issue and not even be relevant, because "they almost always have a visit at root" (at depth 2 according to @crem). Still, it could potentially omitt some moves from temperature adjustment and learning. A non invasive statistically sound solution would be to do an laplacian adjustment on child_visits.

Crude example: accum += std::pow(child->get_visits()/normfactor,1/tau); to accum += std::pow(child->(get_visits()+1)/(normfactor+number_of_children),1/tau);

number_of_children just pseudocode - replace with real code

Also useful to rethink if

if (pick_scaled < accum_vector[i])

might better be <= instead of <..

snoewchen commented 6 years ago

Just to make finding the code peaces easier: in the "next" branch some of the code is found in "UCTNode::calc_proportional"

Tilps commented 6 years ago

I would suggest (weakly) that this shouldn't matter if noise is applied correctly - averaged over many training games, you get some statistical averaging that I think would have a more accurate effect than boosting 0 visits.

jkiliani commented 6 years ago

The current code seems to be working fine to me. Dirichlet noise always has a chance to produce visits even for root moves with very low policy priors.

I should mention that there is currently discussion on Leela Zero (e.g. https://github.com/gcp/leela-zero/issues/1355) to do the exact opposite of what is suggested here, and exclude root moves that only get one visit from being selectable by temperature (since those moves tend to produce the worst blunders).

Videodr0me commented 6 years ago

@jkiliani I read that discussion about Leela Zero Go and am not that convinced it applies to Leela Chess:

(1) Go and Chess are different games. Reaching a terminal state in chess (e.g. checkmate) can happen within a few ply or loosing a Queen (without compensation) in one ply can basically loose the game -> variance in subtree evaluations is much higher. So pruning decisions in one game might not translate well into the other game. In that Go thread they worry about long tactical ladder sequences, which are found faster by pruning. Leela Chess has more of the opposite problem, finding shallow tactics with (judging by policy) "surprising" moves.

(2) Self Play (training games) vs. normal play. Here i think not everything that increases Elo in normal play should be applied in training, in fact it can be detrimental. Training should explore more, while in normal play (e.g. TCEC) all sorts of pruning and reductions can be fine. For training i would argue for a theoretical sound approach (>0 probabilties for all moves). Maybe at a later point in time when leela chess is much stronger, these reductions (and FPU) make sense also in training.

(3) But even if you argue against (1+2): Is this part of the code really the place for a "sneaky" reduction? I would argue from a software engineering standpoint that every function should do what it is supposed to do (according to description and intend) independent of other functions. So even if "other" noise genrated "somewhere" in the code, for a different purpose will reduce the probleme somewhat - and it surely does, i take your word for it @Tilps - it somehow... feels itchy ;-)

(4) A laplacian smoothing/adjustment does not need to be large: +1 was only an example. Any laplacian term will do to put a non-zero prior on all moves. These can be extremely small - i do not really care how small, as long as they some >0 probability mass. Even if only 0.0000000000000000001 or whatever, just to ensures complete exploration as training goes to infinity. I always like it when theoretical properties are ensured, even if only in theory ;-)

Disclaimer: In the grand scheme of things this has probably no impact - but it feels conceptually more correct and especially in view that progress in tactics over millions of games has been less than stellar it seems like a simple very targeted (no impact on other code) change.

jkiliani commented 6 years ago

The discussion in Go was actually about something that's exactly comparable with chess. The player who was winning passed in a situation where that pass leads to an immediate loss when answered with another pass, which of course happened with the ELF weights. That is not in any way different from making a move allowing the opponent to checkmate you.

Videodr0me commented 6 years ago

@jkiliani I tested FPU in 10 tactical positions: using FPU prevented or prolonged leela from finding tactics. The higher FPU reductions the more pronounced was this effect. Skimming through the Leela go thread i had the opposite impression. It seemed in Go the opposite was successful more pruning solved "tactical" issue. But granted i am not a go player and only know basic rules. Maybe i am in the wrong here. But in general I am all for using reductions in regular play - not so much in training games, and even less for sneaking it in a unrelated function.

jkiliani commented 6 years ago

Did you use noise when you tested this or not? FPU reduction at a noised root is deactivated for specifically this reason, that it prevents low policy moves from being explored, which interferes with Dirichlet noise effectiveness.

Videodr0me commented 6 years ago

used defaults of 0.8 leela with net 258 just so see how fpu would affect tactical solution time.

jkiliani commented 6 years ago

With noise or without? As I said, this makes a big difference...

Videodr0me commented 6 years ago

defaults of 0.8 are: cfg_noise = false; cfg_randomize = false;

jkiliani commented 6 years ago

OK, then your test result do not apply to self-play games, since those are with noise on.

Just to be clear: It is very likely beneficial for solving tactical puzzles and also play against engines to both raise cfg_puct to a value higher than 0.6 and set cfg_fpu_reduction to 0. You can even try softmax_temp = 1.2 when giving Leela tactics puzzles. But since you are arguing about changing settings in self-play, you would have to provide data how your proposed changes affect self-play.

Videodr0me commented 6 years ago

Sure they don't - the discussion you brought up from leela go was about FPU reductions and later morphed into a pruning single visit discussion. i just took it as an example on how chess and go might react differently to reductions. The main point of my initial comment is, that currently 0 visit nodes are never (!) chosen, no matter what training temperature is set to. This might be inconsequential, or it might be bad, who knows ultimately - but it seems unintended at that place in the code. If you advocate other pruning schemes, thats fine too, but at least in my mind they should clearly be seperated (or at least marked as such) from this function.

killerducky commented 6 years ago

Self-play game parameters can be found here:

http://lczero.org/training_runs
["--randomize", "-n", "-v800"]

This means fpu_reduction is not turned on for the root node, for the reasons you mentioned:

    if (!is_root || !cfg_noise) {
        fpu_reduction = cfg_fpu_reduction * std::sqrt(total_visited_policy);
    }   

This plus noise being on should help with this issue. Note that there was a related improvement in v0.8, see the release notes. We also disabled tree reuse in the next branch for similar reasons, that will be in v0.9.

I looked at some recent tactical positions and even with all these changes still some winning moves receive 0 visits sometimes. But with noise on they also received 1 or more visits at least some of the time. I think it's probably guaranteed that all moves will be explored some % of the time, given the noise hits them hard enough. I want to get v0.9 released to see if that helps before doing something more.

jkiliani commented 6 years ago

Forcing every root move to have at least one visit would be a setting that would have strange effects when using low visits. Dirichlet noise currently ensures that every possible move will be explored at least with a certain chance. That is enough for the training feedback loop to work.

Videodr0me commented 6 years ago

@jkiliani No, i am not forcing every root move to have at least one visit. The adjustment only pertains to the accum vector, no visits are actually changed. Only the spacing of the uniform probabilites is smoothed by lapacian adjustment (or in bayesian terms a non-zero prior). As the move chosen is promoted to first child this has no other effect. If you find adjustment by +1 too much any other pseudo count is just fine. Lets say 0.00001 or what ever. Currently the probability of zero visit nodes is not scaled proportionally at all. Even if Dirichlet noise mitigates the problem somewhat, its not working like it is supposed to, the probability should be affected by temperature and dirichilet noise, not just by one of them.

killerducky commented 6 years ago

I just want to clarify, are you saying the code does not implement what the AZ paper says?

Moves are selected in proportion to the root visit count

I read this to mean if a move gets 0 visits at the root, then it should be selected 0% of the time. If you are proposing something different from the paper that's fine and we can discuss it. But "the probability of zero visit nodes is not scaled proportionally at all" I don't agree, scaling 0 proportionally is still 0. Again if you're proposing to do something different that's fine, but let's make a clear distinction that you're proposing to do it differently on purpose.

jkiliani commented 6 years ago

You currently get a probability of maybe 0.2 per move for a significant noise hit, I.e. Chance for the noise to raise a zero policy move enough to get a visit. How is that in effect any different from your proposal? For a zero policy move to be played is simply a probability conditional on both noise and temperature, why is that bad in any way?

Videodr0me commented 6 years ago

@killerducky I am pretty sure in proportion to visit count from the AZ paper actually means in proportion to non zero visit counts or laplacian adjusted visit counts. But you are correct, by the letter one could interpret it like you do, but i honestly think that in the MCTS framework it makes no sense to not choose even 0 visit nodes even with some small probability. If this is by design (which i kind of doubt) and not just sloppy language, i would be intrigued by the reasoning behind it --- to me as a statistician it just seems unhealthy. The probability doesn't have to be high but as in all probabilistic processes there have to be damn good reasons to set it to zero regardless of exploration coefficient.

Sidenote: without beeing nitpicky "proportional" means that there is a constant k!=0 so that xk =y then x and y are proportional. For all (pairwise) visits to be proportional such a constant must exist. Leelas implemetation prohibits a constant k for all pairwise visits if zero visit nodes are included. Thus i would say leelas implementation differs from AZ in this respect.

jkiliani commented 6 years ago

The probability of zero visit moves to be actually played is irrelevant. What matters is the probability of zero policy moves to be played, and that is never zero with both noise and temperature on.

Videodr0me commented 6 years ago

@jkiliani I am not saying anything is majorly wrong. Yes maybe the effect is mitigared by other noise - no problem. But as 0 visit nodes are real, and according to killerducky "I looked at some recent tactical positions and even with all these changes still some winning moves receive 0 visits sometimes." this could be an issue. Just pointing out what @snoewchen brought to our attention and trying to be contstructive. If you guys say: "Nah, don't like changing that stuff", i am fine, and will only remind you in 50 nets time ;-).

gcp commented 6 years ago

The issue for me is that the proportional selection tries to select moves according to how much the search liked them. But at the root, with or without noise, some moves will get expanded on policy prior alone, and then, no matter how utterly deeply the search disliked them, be stuck with a single visit, and be eligible for being played. This means the learning cycle can't fix those mistakes.

If you'd spend more search effort, those moves would stay pegged at 1 visit and the probability of choosing them would drop. So in effect, you're dealing with the fact that the distribution at that level is quantized to 1/800 (or 1/3200 for LZ) yet those quantization errors have big effects (huge blunders).

I think your argument works the exact opposite way, i.e. you are saying that if a move was just shy of being selected by the search, you want to fix the error that its probability is quantized to 0. But such a move still has a chance of being selected next time if it gets lucky with the noise, and in fact proportionally so by how close it was. So the current algorithm, on average, already behaves correctly for this case, IMHO. Put differently, I think the Dirichlet noise already addresses what you are trying to fix.

gcp commented 6 years ago

"I looked at some recent tactical positions and even with all these changes still some winning moves receive 0 visits sometimes." this could be an issue.

This would be the expected result for any fix that correctly tries to address "rounding" of 0 visit moves as well. It has to quantize down sometimes, right?

snoewchen commented 6 years ago

@gcp - maybe there is a small misunderstanding what @Videodr0me meant: 0-visit-nodes should not be boosted to 1-visit-nodes (if that is what you meant by "stay pegged at 1 visit").

The argument is just that 0-visit-nodes should have a small chance of being selected because of temperature - give them a tiny space in the accum vector (used to pick a random move) instead of zilch.

Tilps commented 6 years ago

The proposed logic changes the minimum chance of selection from 0 (vs 1/800 for 1 visit) to 1/835 in training scenarios. (assuming 35 possible moves) - so its not far off being the same as the prior chances with 1 visit.

Videodr0me commented 6 years ago

@gcp I still believe your reasoning does not adress the shortcomings of the current implementation.

(1) If you advocate a pruning scheme based on visits at root, then most certainly it should scale with nodes searched. If we would do a 10 million node search, where each root child most probably gets a visit, the current (mis-) use of randomize_first_proportionally to prune 0 visit children would not even be effective. In order for such a scheme to scale maybe cuts of visits below a number of standard deviations from the mean (or median) would be a more consistent way. Also note that the issue is about 0 visits and only partially pertains to "fixing" of errors in 1 visit nodes.

(2) I argue that in self play training (not playing against others) any discontinuous pruning decision should be handled with great care and might be detrimental not only in exploration but also for maximizing cross entropy. In order to ensure (theoretical) full exploration of the rolling learning regime (might see it as a markovian process) probabilites should really be >0 even if really really small.

(3) Quantization is not a problem with laplacian adjustment as it can be done to any desired level. For example for a 800 node search with 30 different move, zero visit node would have 1/830 chance if adjusted by one. If adjusted by 0.01 a zero visit node would have a chance of 0.01/800.3. In fact this helps quantization as the discontinous change from 0 to 1 is somewhat mitigated.

(4) I am not changing any visit counts. I just adjust the interval in the uniform probabiltiy interval. The changes do note affect any other parts of the code. Nothing gets pegged at 1 - its minimal invasive and only gives 0 visit nodes some space on the probability interval as opposed to none. Even if you argue that the (un)intended pruning is beneficial and/or there are other sources of "randomness" "somewhere" in the code that somewhat reduce the issue, it has to be said that from a software engineering standpoint "randomize_first_proportionally" is not really an elegant way to "sneak" this in or to let a bug stand just because other parts of the code make it a lesser problem.

(5) Proportionality != Multiplication. After rethinking (and rereading on the definition of proportionality) KillerDuckys quote from the AZ paper, it is clear (in my mind) that what Leela is doing, is not (!) what AZ is doing. For Leelas current "proportional adjustment, no constant exiss k!=0 that satisfies for all pairwise visits v_new_i = k v_old_i (for alli ). In that respect Leela is not a replication.

(6) Chess vs. Go - of course at various points in learning, pruning schemes can have benefical effects. I still would advocate to prune to an extremely low probability instead of a hard cut off (1 in a million, 1 in a trillion, does not matter as long as >0). But at the moment it could be that leela chess just needs a different learning regime than leela go - so that things do not translate very well. In match play (outside) training certainly there is much more room for pruning also for "hard" pruning.

As a conclusion i would say: Yes its a bug, or at least unintended behaviour. Pruning should be done seperatetly or marked clearly as such and scale with nodes searched. Also, Leela is currently not doing what AZ did. The change does not change any visit counts, just gives a small >0 probability for 0 visit nodes to be chosen (which should be good at least in training). In the grand scheme of things it will probably not have any substantial impact, nothing that a trillion more games with the current noise woudn't find eventually - but sometimes its good that all the little things work, too.

jkiliani commented 6 years ago

@Tilps for any single major blunder that calculation looks correct, but if we estimate that among the 35 legal moves on average, 20-25 are major blunders, then this proposed change would make the frequency of blunders in self-play go through the roof. That's why I think this is a bad idea and @gcp merging 1-visit exclusion was the right call. Whether doing the same here would be equally beneficial is hard to say since LZ uses a lot more visits than Leela Chess.

Videodr0me commented 6 years ago

@jkiliani you can scale the adjustment as you desire. Just set it to 0.0000000001 and nothing gets "through the roof"

Tilps commented 6 years ago

I think you misinterpreted my comment - I was trying to show the connection between the original proposal and comments about it making 0 visits act like 1 visits. (I don't think the proposal is a good idea in that form.)

Videodr0me commented 6 years ago

@Tilps Sure, if we are now talking about how large the adjustment should be, i am all in.

jkiliani commented 6 years ago

So what's the actual benefit of a scaled down selection chance equalling the chance to win the lottery? In most games it will never happen. If any moves with too low policy are picked, it will still be the effect of noise+temperature. If said "lottery win" ever actually happens, it simply means that you could no longer reliably debug suspicious moves from the training data, since anything could then be played at any time, no matter what the result of the tree search is. To me this sounds like putting something in the code which could easily become a major headache later, for no better reason than a sense of mathematical correctness.

Videodr0me commented 6 years ago

@jkiliani Training is about exploration - so to question long held beliefs, even if only rarely is important for effective learning. I get the impression that you believe that learning only takes place from "good" moves, but even if the "bad" moves are really "bad" it helps to discriminate between the "badness" of moves, which is just the mirror side of "goodness".

The comment about "reliably debug suspicious from training data" makes no sense to me - its a random process, it should by design play (presumably) bad moves occasionally. It will even without changing the function in question, due to noise applied - no one complains about debugging those moves. With a fixed seed they are as "debuggable" as anything else in the code.

Every function of a "well engineered" program should work independently from each other. Not only the combination of noise+tempereature should work and ensure full exploration (at least in the limit). But furthermore, If somebody should take the code and maybe decide to train with only a 10 node search- the effect of current implementation might turn out to be a real problem.

About "mathematical correctness", i am not nitpicking about "formality" here - even though i believe no part should break the assumptions which ensures the MCTS convergence properties. - but when has "correctness" ever been a bad thing?

About "winning the lottery". As the level of probability can be chosen by us. I would propose a discussion about what would be a reasonable adjustment, I would argue that during each learning window it should at least be possible to chose a 0-visit move (independent of noise) at least once (on average). What would be a good upper limit? Any thoughts?

jkiliani commented 6 years ago

I get the impression that you believe that learning only takes place from "good" moves, but even if the "bad" moves are really "bad" it helps to discriminate between the "badness" of moves, which is just the mirror side of "goodness".

Actually I completely agree with that, which is why I argued along with a number of other people to switch to t=1 in Leela Zero, which before now was only active for the first 30 moves, in AGZ style.

The comment about "reliably debug suspicious from training data" makes no sense to me - its a random process, it should by design play (presumably) bad moves occasionally.

It already does, and quite frequently. 1-visit moves are usually very bad, since the first visit is chosen by FPU without knowing anything about the evaluation of the position created by this move. This is the whole idea behind excluding moves with only 1 visit from being selected. The comment about debug meant that with the current system, you can for bad blunders always check the training data: If that move got at least 1 visit, it could have been picked by temperature. If it has zero visits, you found an actual bug. Giving moves with zero visits a chance to be selected anyway removes the distinction. Fixed seeds only work if the neural net evaluation is reproducible, which unfortunately it isn't always due to OpenCL errors, and may not even be with cuDNN.

Agreed that training with a 10-node search would not work, but it would also not work with this proposal. The problem is that for training the policy to work, its training target must represent an improvement on the policy, which we get with the visit vector since that is shaped by both the original policy and the board evaluations of all the sub-trees of the child nodes. Out of necessity it is quantized, since the improvement to the policy vector can only come from allowing board evaluations to reshape the policy. The more visits you use, the better and the less quantized this improved policy vector becomes. I think it's likely that several hundred visits is the lower limit where training would still work for this reason. 10 is not enough, and allowing moves with no visit to be played regardless does not change that.

Dorus commented 6 years ago

The issue for me is that the proportional selection tries to select moves according to how much the search liked them. But at the root, with or without noise, some moves will get expanded on policy prior alone, and then, no matter how utterly deeply the search disliked them, be stuck with a single visit, and be eligible for being played. This means the learning cycle can't fix those mistakes.

@gcp I'm confused. If temperature select a 1 visit move, the NN will be trained to better (= more accurately) evaluate this move next round. That means the next time it get 1 visit, that 1 visit will be of higher quality.

If you want the network to not learn these 1 visit moves, you would have to modify how the training server handles them: Remove 1 node from every move during training and these 1 visit nodes will have 0 visits for the trainer.

Also the NN should be able to learn fractions lower than 1/800 or 1/3200. Certainly if a move has 1/800 policy odds, it will not be selected once in 800 visits all the time right? So out of 10 self play games, this bad move will only have 1 visit in say, 4 of them. That would be enough for training to move the odds of this move to 1/2000 next training cycle.

Ishinoshita commented 6 years ago

@Dorus:

If temperature select a 1 visit move, the NN will be trained to better (= more accurately) evaluate this move next round. That means the next time it get 1 visit, that 1 visit will be of higher quality.

I may misunderstand here, but the fact that a node is selected in a given position is unrelated to the training data drawn from the search in that position, it will only influence the trajectory of the game and the portion of the state space that the selfplay explores. A 1-visit child node may be excluded for next move selection while still accounting as 1/3200th. probability in the visit vector, possibly pulling up the policy result for that move.

Dorus commented 6 years ago

@Ishinoshita Yes correct. However since the game wend in that direction, there will also be training data of the next move. Thus the network will be trained on the situation after that move, and thus learn to evaluate it more accurately. If the move turns out to be better than expected, the next training cycle that 1 visit might get extended to more than 1 visit. If it turns out to be even worse, it will stay at 1 visit (since that's the lowest it can get), but hopefully that 1 visit does not happen every time it end up in that positions and thus will be able to continue to feedback loop to lower the policy prior for that move during training.

The only thing required here is that the MCTS does not make the same search every time, but sometimes skips a low priority move during self play. If it only hit that low priority move evert other game, but it end up as a good move, it will get 2 or more visits and on average still have over 1 visit/game on average. If it is a bad move, it will stay at 1 visits for half the games, or 0.5 visit/game, thus the policy priors go down, and next time it will only look at it once every 4 games, lowering it to 0.25 visit/game and so on.

Videodr0me commented 6 years ago

@jkiliani

It already does, and quite frequently. 1-visit moves are usually very bad, since the first visit is chosen by FPU without knowing anything about the evaluation of the position created by this move. This is the whole idea behind excluding moves with only 1 visit from being selected.

The original issue is with 0 - visit nodes in the function UCTNode::randomize_first_proportionally. Independend of a pruning scheme being beneficial or not, pruning should be separated from final move picking - if only for good software engineering practice. Also the 0 visit or 1 visit pruning seems like a rather ad-hoc scheme to prune, why stop at one visit node? Why not also prune a 2 visit node, following your logic they are probably worse than 3 visit nodes and so on. If such a scheme is viable it should scale with nodes searched and/or variance of visits per node, Also i believe (but this is topic for another discussion) that there should be a clear distinction in pruning for training and pruning outside training, i think a good case can be made for making all pruning decisions in training soft (reducing probabilites to a very small but non zero quantity) and using hard pruning (never search pruned moves) outside of training. I will write something up about this soon, just working out some math and hopefully corroborate it through some experiments.

The comment about debug meant that with the current system, you can for bad blunders always check the training data: If that move got at least 1 visit, it could have been picked by temperature. If it has zero visits, you found an actual bug.

If you feel that this info is useful i would make it explicit instead of implicit - and output it along training game data. If size of data is an issue maybe this additonal data can be toggled on the serverside as needed, and also for a certain percentage of games only. Especially in collaborative projects its good practice to not rely on certain behaviour of code to gain debugging information - this is usually only sticky to a few in the "know" and might break when others change code. If something useful for debugging can be output explictly everybody can join and understand the project more quickly.

Out of necessity it is quantized, since the improvement to the policy vector can only come from allowing board evaluations to reshape the policy. The more visits you use, the better and the less quantized this improved policy vector becomes.

Here we totally agree, more visits are clearly better. With only 10 visits learning would take a substantially longer and might yield inferior results. The example was to make clear, that one never knows what others do with the code (good or bad) and thus it should be somewhat robust to edge cases.

But again, this whole issue is "minor", i would not have guessed that it would start such a long debate.

ASilver commented 6 years ago

"The discussion in Go was actually about something that's exactly comparable with chess. The player who was winning passed in a situation where that pass leads to an immediate loss when answered with another pass, which of course happened with the ELF weights."

Funny story about this: the idea of mentally passing a move, to see what would happen if the other side moved a second straight time is the basis for the chess algorithm known as null-move.

https://en.wikipedia.org/wiki/Null-move_heuristic

AFAIK the first program to use it was a very early version of Fritz. It led to a massive Elo increase of ~120 Elo or so, and soon became standard.

Anyhow, I would never suggest it for the training games, where I think it is counterproductive, but for the playing version, that is another story, and might merit testing.