Open gjm11 opened 6 years ago
Yeah, we could probably double the global speed of selfplay games (or alternatively play 40b selfplay games at the speed of 20b selfplay games) just by adjusting the speed of no-resign games! Something should probably be done...
An illustration here. To investigate a little bit, I put autogtp with a single thread (instead of 2 or 3 threads generally) since yesterday in order to easily measure the duration of saved games :-). Here are the results My computer (GTX1080) took only ~7 minutes to play that game to move 153 where White is totally lost (White was already lost for 20 moves and just let Black kill its big top left group: game is more than over at that point): But the game took an additional 1 hour and 1 minute to go to the end, with move 722 (playing many horrible moves, extremely slowly) to finish with that position (which I won't comment ; -):
During that hour, my computer could have played about 7 to 10 normal games (depending on their length)!
I know we need no resign games to cleanly learn difficult positions (such as sekis and multiple kos) and avoid bad classification of win/lost games. But do we really need to play these end-games at full strength? I doubt so, as explained by @gjm11 On the opposite, as long as the resign threshold is not reached, this is a complex position and we should not change anything, even if the game is long! Eg a 585-moves game with a seki&double-ko in a match game (LZ172 vs LZ171) yesterday: http://zero.sjeng.org/viewmatch/b8a9ae1e2e45f7b2c53b48adbc895e59b620e5539620bc1aeff771dd2f6e7cc5?viewer=wgo. But that's coherent with @gjm11 proposal: reduce visits only if resign rate is reached, which was obviously not the case in that game, as it continued for a long time.
One counter-argument goes like this. "The point of no-resign games is not only to anchor winrates to reality, it's also to give LZ some experience in playing highly unbalanced positions. So we want the best play we can have in those positions, and cranking the number of visits way down will impair that and make LZ less good at dealing with those situations."
To which I think there are two things to say.
First: it is clear that LZ-with-1600-visits is not playing well in these highly unbalanced positions. Many of those post-resignation moves are total nonsense. They may actually be worse moves than would be obtained by just following the policy head without any search. I'm sure they're worse moves than would be obtained by taking (say) the best 5 moves from the policy network, playing each and evaluating them with the value head, and picking whichever one comes out best. I suspect that doing ordinary MCTS with, say, 10 visits might be close to this.
Second: it is plausible that this disastrously bad play only starts happening once the position gets more unbalanced than the resignation threshold, in which case (a) cranking the visit count way down immediately might indeed impair learning a bit and (b) it might miss some chances for the player who would have resigned to turn things around. To deal with those, perhaps the condition for rapid-play should be something a bit more complicated: drop the visit count when (1) one player would have resigned and (2) the current position's value-head says (say) either <10% or >90%. Or: drop the visit count 20 moves after resignation would have occurred, if the position is still resignable. Or something along those lines.
(I do still think that the simple policy of dropping the visit count way down as soon as either player would have resigned would probably be highly beneficial, even without those tweaks.)
@gjm11 Agree with your comment:
They may actually be worse moves than would be obtained by just following the policy head without any search.
I actually did the test a while ago on 15b with 1po and with 25v and this resulted in very sensible endgames. [here] (https://github.com/gcp/leela-zero/issues/1656#issuecomment-408098769)
With low playouts we definitely want to turn off Dirichlet noise and possibly randomness as well.
I wonder how a combined approach of https://github.com/gcp/leela-zero/issues/1361 (scaling c_puct in tree search, effectively making winrates more differentiated) and https://github.com/gcp/leela-zero/issues/1610 (use cross-entropy instead of MSE for value loss in neural network training) would fare. From https://github.com/lightvector/GoNN#cross-entropy-vs-l2-value-head-loss-aug-2018:
Theoretically, one would expect cross entropy loss to cause the neural net to "care" more about accurate prediction for tail winning probabilities, and in fact this manifests in quite a significant difference in average "confidence" of the neural net on about 10000 test positions from the validation set: And indeed by manual inspection the cross-entropy neural nets seem to be much more willing to give more "confident" predictions:
@gjm11 You might want to see/check my figures here: (https://github.com/gcp/leela-zero/issues/1789#issuecomment-417336159). Hope I'm miscalculating...
Your rough estimates seem pretty much in line with my rough measurements. Mine are a little less pessimistic than yours, but it seems very clear that a very large fraction of LZ contributors' computer time is going to "post-resignation" moves.
I understand that there's a significant cost, but on the other hand we need some of these games to correct the misevaluations. It may very well be that the engine could speed up on extreme winrates (actually, Leela 0.11 has exactly this kind of logic...), but I don't see any good way to quantify the effect on the entire training procedure (a common problem here!), so I'm wairy of deviating further from the AGZ baseline.
@gcp Using t=1 all game long is a major deviation from AGZ paper. It is undisputed that this is all for the good of policy head, although possibly at a little cost for the value head as the outcome of the game becomes a bit more noisy. But why keep t=1 after the resignation threshold has been hit?
If, as you just reminded here above, the goal* of playing a fraction of no-resign games is for the value head to discover some mis-evaluations, then to achieve this goal we need the position and the rest of the game to be played by both sides as perfect as possible, i.e. without randomness. Very often, in the course of the tail of a no-resign game, the score drift in favor of the winning side. It becomes less and less likely that a mis-evaluated local situation can reverse the game, no mention of the fact that playing with a lot of randomness (what happens under very high/low win rates) may be not the best way to discover the true value of a local situation.
It might be that under this setting "t=0 after resignation threshold first hit", the game would receive a much more reliable evaluation, and incidentally would be shorter (much easier to catch two consecutive Pass best moves without than with randomness).
Much related question: why are resign analysis done on games played with t=1 even in the part of the game where the resignation threshold analysis should take place? Suppose we want to determine the false positive rate beyond (=below)15% 'best score'. Shouldn't t be turned to 0 once this 15% are reached. If not (with t=1) aren't we generating a fraction of false false-positives? Has this fraction been evaluated?
I would a priori ruled out the goal of t=1 after the resignation threshold being for the sake of exploration. Due to the flattening effect of the search, this amounts to an exploration ratio equivalent to much higher level compared to under ordinary win rates.
@Ishinoshita Seems a good point indeed to turn t=0 after resignation threshold is hit: it should reinforce the value head (and as a side effect, it could also shorten no-resign games, which would be welcome :-). I also think we could lower drastically the number of visits once the resign threshold is hit, to v=200 for example: I don't see why it would have adverse effects, but that cannot be excluded. So it's more speculative...
@Friday9i To be complete, if we want the best possible play from resign threshold on , we should in fact remove any sort of randomness, i. e. remove as well dirichlet noise. Then, whether 1600v is a real gain in terms of value accuracy vs 200v is another question, that should be answered with the same metric as the one used for fixing the resignation threshold, i. e. by comparing the false positive rate with 200v starting from resign threshold hit vs continuing with 3200v.
For Chess, it looks like T=1 in end games has caused leela chess zero to have trouble learning correct end game value and policy https://github.com/LeelaChessZero/lc0/issues/237. (Briefly, chess end games can require many consecutive exact moves and noise+temperature makes the network incorrectly learn some positions are winnable due to the opponent randomly blundering.) It seems less likely that Go has these "chess end game problems," but generally, T=1 throughout the game might not have been totally thought through for AlphaZero, or at least it was sufficient enough to make a "good-enough-to-publish" approach rather than optimizing for network strength and/or network progress efficiency.
Switching to T=0 after hitting the normal resign threshold is interesting, and it would be more like AGZ in generating end game training data. @Ishinoshita any particular explanation if this would even help speed up end games? (I suppose, did people run into this extremely slow end game situation before we switched to T=1?)
Or the temperature can be reduced gradually as the number of legal positions decreases, if such changes are too extreme?
@Mardak
any particular explanation if this would even help speed up end games?
A no-resign game will continue until 722 move count unless two consecutive pass moves are made. If at some point in the endgame both players best moves are 'Pass', it will be played like that with 100% certainty if no randomness. With t=1, the opportunity will be missed most of the time because of the very flat policy.
Also, you will note that usually, when scoring a game, you add a rule for not filling the penultimate eye of a group.
With t=1, the losing side let most of its group die by failing to making eyes in due time or, even worse, by filling its own eyes. This makes big groups being captured, creating large opening in which both sides will play again, etc... This just delays the game end. Note that even 1po plays very decent endgame, avoiding these useless (and desperation) series of captures.
@isti2e LC0 is considering temperature decay with but not clear for me what is the status.
@Mardak
It seems less likely that Go has these "chess end game problems,
True overall, agree. But go has tactical problems of it's own, like some semeai, that need a long tactical line to be correctly solved. Suppose there is an unsettled semeai on the board, with one side winning the semeai (or able to get a seki). At some point in the game, when adjacent liberties are filled, one side will need to answer all opponent's moves in that area on the right way to get the optimal result. One single wrong move can make the situation flip.
Suppose such a position remains on the board, unsettled, after resignation threshold is hit. Policy will be rather flat, given the extreme values. Even if the best move found by the search is the 'right' move, randomness may very well pick another move. That's true for both side, I must admit. But then this amount to say that we are trying to evaluate the semeai with a single more or less random rollout ;-).
So I'm inclined to believe that no-randomness endgame past resignation will evaluate the board more accurately. But offered with no proof, just my feeling.
Using t=1 all game long is a major deviation from AGZ paper.
But it's in the AZ paper, and they had a good outcome for Go. So this isn't an argument at all.
But why keep t=1 after the resignation threshold has been hit? If, as you just reminded here above, the goal* of playing a fraction of no-resign games is for the value head to discover some mis-evaluations, then to achieve this goal we need the position and the rest of the game to be played by both sides as perfect as possible, i.e. without randomness.
If the value head is wrong, I'm not sure why you think removing some randomization improves the chances for the network to stumble on the fact that it is. I'd be afraid you'd make things worse because it'd keep repeating the same mistake.
Think about this: if the value head is wrong, why would playing the best move according to this same value head mean the engine "plays both sides as perfect as possible"? I'd expect the exact opposite! Without randomness it may never find out why it's wrong.
Briefly, chess end games can require many consecutive exact moves and noise+temperature makes the network incorrectly learn some positions are winnable due to the opponent randomly blundering.
Remember that we switched to t=1 exactly to avoid the situation where you open at 2-2 and LZ says 0% winrate and suggests you play random moves from now on. This is what a "correct" t=0 should lead to.
In chess, if the position is objectively draw, but much harder to defend for the opponent, do you score it as 0.0 or higher? The correct answer to this may depend on how strong the opponent is (i.e. contempt).
(This is a bit different than the t=0 after resigning discussion. The idea that the program has to play "correctly" there seems rather questionable for the reason I laid out in the previous post)
That's true for both side, I must admit. But then this amount to say that we are trying to evaluate the semeai with a single more or less random rollout ;-).
Following this reasoning, the program will end up preffering positions where the semeai is settled as "more winning" than when it's left unsettled.
This doesn't even seem wrong to me.
@gcp My mistake. I stand corrected, thank you.
why you think removing some randomization improves the chances for the network to stumble on the fact that it is?
Just (bad, poor human) intuition, must admit. In endgames, outcome possible misvaluations often boils down to groups status misvaluations. Like in chess, I would expect a randomized players pair to not discover a long, precise tactical line, and, on average, to wrongly evaluate such a position.
But your initial point regarding AZ is hard to refutate... Let say it there is no technical argument, just a psychological issue, easy to work around ("NEVER launch LeelaWatcher ! NEVER open a self-play game! " ;-)
AZ uses 800 playouts while we use 1600 visits, which take about equal time in normal situations, but past resignation the former is probably much faster. Maybe we can switch back to 800 playouts when resignation threshold is reached if the goal is just to speed up things. BTW: I'm surprised that the reduced playouts (and increased randomness) in AZ don't require smaller Dirichlet noise to compensate for.
@gcp I take your point on AZ has killer arguement, the most convincing one. Just to explain my point however :
Think about this: if the value head is wrong, why would playing the best move according to this same value head mean the engine "plays both sides as perfect as possible"? I'd expect the exact opposite!
Post-resignation, the loosing side usually plays quite badly and the value quickly drops much below 5% (like between 0 and 2%). At this point, the search is no more guided by the value, which is close to binary. It's mostly driven by the sole component which has not gone wild at this stage, the policy raw, which generalized quite well from Pre-resignation positions and retains good go common sense even in these extremely unbalanced situations. Distracted by the dirichlet noise and flattened by the cpuct in the course of the 1600 visits search (almost no tree reuse). So, best move is only or too marginally influenced by value itself. Then this 'best move' will probably not be selected by temperature...
Without randomness it may never find out why it's wrong.
But with mostly randomness in the move selection (random picking from a flat and very spread distribution), I don't see how a complex tactical line may receive a good evaluation on average, so that the value head will receive some useful signal over the course of the training, for that position or local situation. The average outcome will be the Monte Carlo evaluation of the position, no?
No dispute randomness is necessary for self-play learning. I believe that all the discussions here boil down to a matter of proportion. Some people, like me, find it hard to figure out that there is some useful signal with that proportion of randomness. But intuition is sometimes misleadind. And then again, there is the AZ argument....Anyway, thanks for having taken time to explain. I still plan to upgrade with a GPU ;-)
But it's in the AZ paper, and they had a good outcome for Go.
AlphaZero paper shows Performance of AlphaZero in Go, compared to AlphaGo Lee and AlphaGo Zero (20 block / 3 day)
. Maybe AZ-Go could eventually get stronger than AlphaGo Master and AlphaGo Zero 40b, but clearly they had those reference Elo numbers and chose not to publish comparisons, which probably would have shown lacking strength relative to the other AlphaGo approaches.
This is why I referred to sufficient enough to make a "good-enough-to-publish" approach rather than optimizing for network strength and/or network progress efficiency.
If we're looking to efficiently generate stronger Go networks, I would think AGZ approach can be more suitable than AZ's.
Sounds like most of these proposed changes need some comparable evidence generated. I hope it isn't too off topic, and I believe it has been mentioned previously, but should we be thinking more seriously about using some of the game generating capacity to try experiments that might be useful for these sorts of things?
In chess, if the position is objectively draw, but much harder to defend for the opponent, do you score it as 0.0 or higher? The correct answer to this may depend on how strong the opponent is (i.e. contempt).
The intent of Alpha(Go)Zero approaches is so that self-play plays against an opponent that is of similar level: AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo’s own move selections and also the winner of AlphaGo’s games. This neural network improves the strength of tree search, re- sulting in higher quality move selection and stronger self-play in the next iteration.
At least for Chess, noise+temperature mis-evaluating positions as not-drawn because its training data shows its opponent is more likely to blunder on average. This causes policy to train towards these seemingly winnable positions but ends up drawing, and when compared against Stockfish, potentially AZ without these end game issues could have won more games instead of drawing.
Playing with noise and temperature is by design to increase exploration at the short-term cost of "optimal" play with the hopes of improving long-term strength, but it seems that the latter might not be obtainable or significantly restricted in certain situations, e.g., chess end game.
Remember that we switched to t=1 exactly to avoid the situation where you open at 2-2 and LZ says 0% winrate and suggests you play random moves from now on. This is what a "correct" t=0 should lead to.
Yes, and the project intent there was to make a more handicap-friendly network. People playing handicap games where Leela resigns at 92 moves is not that fun. Objectively, a 2-2 opening move with the opponent playing the best moves probably does lead to close to 0% win rate.
Now that there are approaches such as dynamic komi #1772 to better support handicap play, perhaps T=1 all game is not as necessary. ??
If the value head is wrong, I'm not sure why you think removing some randomization improves the chances for the network to stumble on the fact that it is.
If the value head is wrong precisely due to temperature randomization, then increasing the percentage of training data that plays the most visited move would correct the mistake. In analyzing some Chess positions, the probability of Leela Chess Zero playing the correct move for 50 moves to trigger a draw can be less than 50%, i.e., the "average training data" shows the position is winnable. So if we say that normally would have been 45% draw and 55% win with T=1, if 10% of games were playing that end game position with T=0 and flipping it to 55% draw and 45% win, new networks would at least train closer towards drawing.
From the earlier linked chess issue, dropping temperature from 1.0 to ~0.9 would make the likelihood of playing 50 moves to draw increase from 36% to 50%, so there doesn't even need to be "large" changes to temperature.
For example, an "in between" AGZ and AZ approach would be T=1 for first 30 moves and T=0.8 for the remaining moves. (As opposed to AGZ T=1 for first 30 and T=0 for remaining; and AZ T=1 for all moves.)
There are several proposals in this thread that are meaniningful but that would need to be tested 'live' on the main project since they cannot be 'proved' to work with reasonings or offline tests, as they concern hyperparameters of the self-play training process itself. Discussions can be endless.
IMHO, taking time to test at least one of them is as valuable as switching to a 40b network for instance. We may learn usefull tweaks to would benefit the project on the long run and bring a better understanding. I would personaly be more incline to donate computing time for more 'exploration' for a while than just for more strength.
I modified leelaz on BRII and since last Sunday it only randomize the first 30 moves (i.e. back to AGZ setting) regardless of what the server/autogtp asks for. This should shorten no-resign games (playing passes earlier and reusing more of the tree due to less randomness) and make self-play games enjoyable again (due to less blunders), and also serves as an experiment to see whether it speeds up progress and improves endgame (still behind ELF, it seems). More extreme winrates can be anticipated, but people interested in 40b probably care more about pure strength (e.g. competing with Fine Art) and more objective winrates to understand AGZ 40b games, and less about moderate winrates for analyzing human games. (ELF's extreme winrates remain unexplained (https://github.com/gcp/leela-zero/issues/1788); its black winrate should be inflated but there's an example where black NN eval is a lot lower than played-out results which is also unlikely due to MCTS parameter difference.) High handicap performance now hinges upon monotonicity in komi and not moderate winrates right now; even though 40b has been trained with -m999 games, monotonicity has vanished and people are relying on pangafu's move filtering strategy to use 40b in handicap games.)
@alreadydone
I modified leelaz on BRII...
What is this Big Red II instance ? The server managing game requests? So, shallI we understand this is a live test?
Also just to confirm, as I understand there is no way to identify the difference in playing conditions in the training data is there? (except from being from a particular network).
BRII runs the autogtp client; it receives game request but does not send them, so the change won't affect other clients. Presently you can't identify the games from the parameters in the sgf, only from IP or using Ishinoshita's analysis script. I'll modify autogtp to make sgf show the correct -m value. (Update: done.)
BTW, I have finaly shared my python script as a jupyter notebook in my repo, although it's something far from being achieved. It's fresh beginner's code, I crave your indulgence ;-)
@alreadydone
it receives game request but does not send them
Sounds like a first great experiment in this matter, although I'm not sure I fully understand the plan. How many -m 30 self-play games do you plan to generate with BRII, to start with ? Will you train a forked network by yourself with those games or will you make them available to someone else who will do the training ?
Well, the games with -m30 are still sent to the official server and they are not separated from the other games as far as I know, so both @gcp and @bjiyxo will be using them in training. Since games from BRII make up a significant portion of the training data, we can observe the effects from the performance of newly trained networks.That's the plan.
The problem is that there has been injection of a proportion of Elf1 games recently. So, we now have regular LZ' -m999 with 10% -r0, Elf1' (-m999 but only -r5) and BRII' -m30 self-play games, each in variable proportion in the training window over the time. I am a bit concerned about the possibility to draw firm conclusions. I had imagined your -m30 experiment, or whatever trial respect to the question if randomness level, all things being equal. Let's see !
even though 40b has been trained with -m999 games, monotonicity has vanished
Do you mean the dynamic komi stuff no longer works with the official networks, because adjusting the komi planes produces erratic results? If so that's unfortunate.
relying on pangafu's move filtering strategy to use 40b in handicap games
Which one is that? Only playing top X or top X% policy moves? I'd take pull requests to add this to the main/next branch, especially now that it can be configurable via lz-setoption etc.
@gcp Yes 40b doesn't look good for dynamic komi. @pangafu's mod isn't currently open-source; he once said he'll open-source when the code cleaned up, but I think he's still testing (distributing exe inside a small community; people joked about 3-meter long parameters like the following).
-g -t 8 -r 0 --batchsize 5 <my parameters: --tg-auto-pn --handicap --min-wr 0.12 --max-wr 0.24 --wr-margin 0.10 --target-komi 7.5 --adj-positions 400> <pangafu's parameters: bias-nonslack --bias-center 0.6 --bias-rate 1.5 --bias-maxwr 0.95 --handicap-filter-step 40 --nolack-max-wr=0.55 --nolack-min-wr=0.15 --nolack-wr-margin=0.1> -w E:\go\lizzie\N_GX65
IIRC it relies on some hard-coded rules with pattern matching, like only allowing approach and never 3-3 invasion to a star-point, and forbidding moves far from the corner for the first move, etc.
There was some discussion in #1681 of this, but that's clearly not the right place for it and I think there were some misunderstandings.
It looks to me as if an alarmingly large fraction of LZ contributors' computer time is spent playing the "tails" of no-resign games, and I suspect that playing those tails faster might be a better tradeoff, achieving perhaps double the overall rate of self-play at what seems likely to be negligible cost in quality of games.
(Pre-emptive clarification: I am not suggesting either abandoning no-resign games or reducing the fraction of games played without resignation.)
The cost
My laptop has spent about 11 hours today on LZ self-play games. Its GPU isn't very fast; it's played 10 games in that time. Two of them (exactly the "correct" proportion) were no-resign games.
Those two games took almost exactly twice as much time as the eight other games put together.
This means that the "tail" of a no-resign game (i.e., everything after when one player would otherwise have resigned) costs about 7x as much as an ordinary game. (I assume that the "head" of a no-resign game, the part before a resignation would have happened, costs the same as an ordinary game does.)
Is it worth it?
There's no question that we need no-resign games, and I expect the fraction of games that should be no-resign has been chosen well. But does playing out their "tails" at full strength buy us enough to justify the cost? Even when (1) by all accounts the actual quality of play in the tails of these games is low, (2) in normal circumstances LZ-with-one-visit plays at a good dan level, and (3) it's credibly conjectured that LZ's bad play in highly unequal positions is at least partly caused by the search?
Maybe it does. But to me it seems rather unlikely.
One concrete proposal
Suppose that in no-resign games LZ switched from 1600 visits to 200 visits at the point when one player would have resigned. (200-visit LZ is an extremely strong player, of course.) Then the "tail" would cost about the same as the "head", and in the time it currently takes to play 8 "resign" and 2 "no-resign" games LZ could play 16 "resign" and 4 "no-resign" games, with a little to spare.
Can the cost of (possibly) slightly lower-quality "post-resignation" play in no-resign games really be enough to outweigh the benefit of twice as many self-play games?
I don't claim that this is the best option. Some other tradeoff between speed and quality might be better. I picked this one just because it makes the issue nice and clear.
How could I be wrong?
If LZ really plays appreciably better post-resignation with 1600 visits than with 200 visits, then the cost of the proposal above may not be negligible (though I would guess it's still small).
If my calculations are wrong somewhere and the cost of no-resign games isn't as large as I think it is, then the benefits of speeding up no-resign tails may be smaller than I think.
If my measurements are terribly atypical (something about my machine makes it extra-bad at no-resign games' tails, or I just got unlucky today), then of course any estimates based on those measurements will be misleading.
If speeding up self-play by 2x wouldn't bring substantial benefits (e.g., if the actual bottleneck is now NN training time on whatever machines GCP uses for that, and if that bottleneck can't easily be widened) then obviously there's not much use in making self-play more efficient.
If the risk of making any change at all is so great that not even a 2x gain in learning speed outweighs that risk, then of course it's better to leave things as they are.
Or of course there might be important things that I haven't thought of at all.
Boring algebra
Just to explain the calculations I left implicit above: Suppose the cost of the tail of a no-resign game is t times that of the head, or equivalently of a normal game. Then two no-resign games cost 2(1+t) normal games. If that's twice as much as eight normal games then 2(1+t) = 2x8 so 1+t=8 or t=7. Reducing from 1600 to 200 visits in the tail would reduce t to a little less than 1. If it became exactly 1 then the cost of 16 normal plus 4 no-resign games would become 20+4 = 24 normal games, the same as the 8+2(1+7) cost of half as many games before the reduction.