Open ASilver opened 6 years ago
I think we should rely on the Dirichlet noise and temperature to do exploring.
If we do not tune PUCT for best self play match results, how do we tune it?
There is always give and take, so the balance between exploitation and exploration is obviously a complex one. I have done a lot of testing with the PUCT values to better understand their effect on Leela's play. One thing has come out very clearly: a higher PUCT value always leads to better tactics. Sometimes it find moves it did not, and at all times moves it finds are found much much faster. The NN is learning what moves to value, and what moves not to, which is the purpose of the training. I think it should be encouraged to seek more moves, and not fewer.
There is another thing worth adding: in the training games both sides get the same PUCT value, so you aren't really making it be stronger than itself, you are making it beat a previous executable that is not being used. In other words, regardless of the PUCT value, both sides are getting the same, so the lower PUCT value is only stronger against a previous version not in use. The only effect in the training is fewer moves being analyzed.
I think given that zz tuning at longer time controls like 10 minute showed that fpu 0 and puct 0.7 make sense.
It makes sense to tune for the value you want to use in matches even if it is maybe a 15 elo weaker selfplay at fast tc
On Tue, May 15, 2018, 5:49 PM ASilver notifications@github.com wrote:
There is another thing worth adding: in the training games both sides get the same PUCT value, so you aren't really making it be stronger than itself, you are making it beat a previous executable that is not being used. In other words, regardless of the PUCT value, both sides are getting the same, so the lower PUCT value is only stronger against a previous version not in use. The only effect in the training is fewer moves being analyzed.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/glinscott/leela-chess/issues/610#issuecomment-389324513, or mute the thread https://github.com/notifications/unsubscribe-auth/AO6INNkY0uek9YY6XSEuxj3pgv-aURkQks5ty01NgaJpZM4UANGq .
Well, you can tune for matches or you can tune for learning.
In addition to what's already been said here, I think the recent bugs/regression/overfitting should be considered as well... It's evident that the buggy data had caused the network to learn some really bad policies, and it's not going to have the easiest time unlearning some of that. Until it's clear the network has recovered, I favor basically anything that will increase exploration during training.
I ran a small test to illustrate the difference. I took a set of easy tactics with single solutions and tested id302 on them with v10.
Default settings (10 seconds per move): 91 of 201 matching moves Default settings (20 seconds per move): 98 of 201 matching moves PUCT=1.5; FPU=0 (10 seconds per move): 109 of 201 matching moves
We tuned it based on self play, because our goal is to win chess games. The theory is if you use the settings that result in strongest self play, and run your training loop using the strongest self play, you will get the best feedback to improve the net. LZGo used the process and it worked well.
The problems started with v8, which is the same release that changed PUCT. This makes PUCT a suspect. But the current theory is that PUCT was not wrong, instead it only exacerbated the long standing overfitting problem.
If we cannot tune based on self-play results, then we would need to tune based one ~1 week experiments with hyper-parameters. Maybe we're really in that situation, and it might be worth the time to do that experiment. But I propose we continue with Error's plan and see if things get better.
I think your proposal is to change the tuning process from best-self play to best for solving tactics. But I don't think that's a good way to tune the parameters, because all those positions have solutions that can be found by doing a wider search. You will always end up with parameters that favor wider search if you tune using this method, and those parameters will not be the best for self-play results. We won't know if they result in best settings for training feedback loop without doing ~1 week long tests.
It's a judgement call which 1 week long tests we should do. I think this is an interesting test, but I think there are others that are more interesting first.
I agree with @killerducky's basic reasoning that without tuning for strength, the parameter choices become somewhat arbitrary, but unfortunately I also have the feeling that tuning for tactics to some extent may well be necessary for beating any kind of Alpha-Beta engine, and that this metric is very important for a large part of our support base.
We should probably clarify the project aim in this regard: For the deepest positional play, I'm sure we're on the right path, but if the aim is to compete successfully in the next TCEC and eventually defeat Stockfish, probably not. As the latter goal seems to be very important for many, there should be some sort of vote or at least debate on project objectives.
Maybe another metric we could investigate here is the effectiveness of the Dirichlet noise feedback? If we had some statistics on how often noise finds low-policy, high value moves at different PUCT values and how effectively the policy on those would be raised by feedback, that could provide a more sound reasoning for a higher PUCT?
Since I strongly suspect that tactical ability is highly correlated with noise effectiveness, I propose we investigate this aspect some more. I'm not sure if the training data will allow easy extraction of how often moves with low policy priors generated high visit counts, logged debug data should definitely have this information though. If it turns out that this feedback cycle is vastly more efficient at higher PUCT, we should probably raise the value to some sort of balance between self-play strength and noise feedback efficiency.
Idea to consider. Tuning for strength might be a valid approach - but rather than tuning for strength at 800 nodes, we tune for strength at large number of nodes. I have couple of ideas for why this would be justifiable, neither of which is perfectly convincing. But still seems suggestive that its a decent idea. 1) We want the resulting net to scale well, its possible that the tuning conditions should be aspirational to get the best value out of training towards the goal of scalability. 2) Training reinforcement is kind of about making 800 nodes more and more effective. So the convergence over time should be towards that 'large nodes' behaviour - so the puct that is good for that, aligns well with the reinforcement goal.
Based on some other numbers I've seen before, this justification would suggest raising training puct from 0.6 to at least 0.7, possibly 0.8 would be a good idea.
I could be wrong of course, but I see a problem with using only a randomizer to hit on a tactic that it would not find otherwise:
If it doesn't understand the tactic, then the very next move it could ruin the purpose, and thus its conclusions. Remember, you are forcing it to play a move it did not like at first, so that doesn't mean it will suddenly, magically, play the correct continuation.
After all, the next move it still doesn't like the direction the game is going, so the correct continuation might be nowhere near its favorite moves, and it will then depend on the Dirichlet lottery to continue correctly.
I am sure the Dirichlet randomizer is a great idea to promote different positional ideas, but I cannot see it being good for tactics, unless it is fortunate enough to realize the very next move it wins.
WAC, my revised version, is very good at one thing: showing certain types of tactics it is almost permanently blind to, no matter how shallow, and there are a few.
Dirichlet noise most definitely works to find tactics. The idea is that the network already knows the correct tactical line up to n-1 plies, but it doesn't know the first move yet. Then noise will allow it to eventually recognise the tactic one ply deeper than before. While you can't teach the net very complex tactics all at once, it will over time improve its ability at it solely due to noise, as long as the noise feedback is strong enough. My concern is that this feedback was stronger when we had higher PUCT, which allowed to net to improve its tactics better. In our regression phase, it probably forgot a lot, and it may currently have difficulty relearning the tactics if the feedback is weaker now. So let me assure you, Dirichlet noise has nothing to do with a lottery or randomiser. It will fix the policy head in whichever direction is actually conductive to winning games. That includes tactics as well as positional play.
Ok, thanks for clarifying. I don't do a lot of tactics testing since I favor games, but there is zero question that this is the number one thing holding it back (overall, aside from bugs and the like). Obviously right now it needs to be cured of its suicidal closed game evals, among others, but that is bug-related (ID237 did not have them for example), and for another discussion.
I support an immediate raise to 0.7 puct. It's not far above the 0.677 puct for 4 minutes games that was the highest puct tune.
On Fri, May 18, 2018 at 10:15 AM, ASilver notifications@github.com wrote:
Ok, thanks for clarifying. I don't do a lot of tactics testing since I favor games, but there is zero question that this is the number one thing holding it back (overall, aside from bugs and the like). Obviously right now it needs to be cured of its suicidal closed game evals, among others, but that is bug-related (ID237 did not have them for example), and for another discussion.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/glinscott/leela-chess/issues/610#issuecomment-390221059, or mute the thread https://github.com/notifications/unsubscribe-auth/AO6INCf2FhYt30Owl3iuNE9HcyVw3_oXks5tztd9gaJpZM4UANGq .
I'd support a revert to 0.85, for the simple reason that we had tactically much stronger nets before introducing the PUCT change, and it would eliminate one more of the suspects for the current weakness. After all, we reverted FPU reduction in training, might as well go all the way now and revert PUCT change.
"I'd support a revert to 0.85, for the simple reason that we had tactically much stronger nets before introducing the PUCT change, and it would eliminate one more of the suspects for the current weakness."
I can't argue with that logic.
It's pretty easy to argue with it. We are progressing rapidly now, so the problem must have been FPU, value head overfitting, or the LR, all of which we fixed (at least partially). There is no evidence that PUCT itself has more than guilt by association. Value head is already recovering fine. But I still support using a longer TC tune for puct, but I agree with @killerducky that if you can't use self play tuning or gauntlet tuning but have to try things for a week, we're pretty lost since we have so many experiments to try.
I am confused every time I hear somebody mention that the recent regressions and blunder are a result of "overfitting." Instead, I think it's perhaps more accurate to say that the recent regressions and bugs are a result of "fitting" - that is, fitting to bad data that was generated by a buggy engine.
The bugs in the engine were certainly the primary cause of everything that's gone bad, not learning rates, oversampling, etc; those additional factors may have aggravated the blatantly obvious underlying issue, though... My understanding is that PUCT tuning was done with a buggy engine on networks that had been training on bad data that was generated by buggy engines, correct? Those values should be ignored in my opinion.
My comment here is again that I am in favor of any changes, within reason, that will increase exploration -- as I think this is the best and quickest way to recover from Leela "fitting" to bad data.
@so-much-meta please check https://github.com/glinscott/leela-chess/wiki/Project-History for a timeline, and look at the Elo graph. Also https://github.com/glinscott/leela-chess/wiki/Large-Elo-fluctuations-starting-from-ID253 for a summary of the issue.
v10 was released at around ID271, and the graph continued to go down. value_loss_weight was changed around ID287, and the graph immediately started to go up. This plus many other indicators show the main problem was overfitting, not the bugs (rule50, all 1s) or the other params (puct, fpu).
Do you have data that shows the opposite?
We have also redone the PUCT tuning after the bugs and overfitting were fixed, and it shows the current value is still good. We will retune again later when the net has recovered more.
@killerducky I presume by graph, you mean the rating. I know the ratings are not directly comparable, since in a direct match NN237 (rated 5544), beats NN303 (rated 5710), by well over 100 Elo.
@jjoshua2 Tuning it to beat itself in self-play is not the same thing as tuning it to learn the most the fastest. In fact, this was proven wrong already well over 10 years ago with Chess Tiger 2007. Christophe Theron, its programmer, had predicted a 120 Elo increase by virtue of self-play versions. The reality was.... zero once faced against opponents other than itself.
The self-play ratings are not perfect, but they correlate reasonably well with external tests. They are good enough to show around ID271 things continued to get worse, and from ID 287 things started to get better. External Elo tests agree with this overall trend.
I'm aware that ID237 vs ID303 doesn't match up with the graph, but I don't think that invalidates my point. I expect the new nets will surpass ID237 in all Elo tests soon.
I certainly look forward to it. I have isolated a number of extreme misevaluations in king safety and closed positions (as extreme as +6 when it is more like 0.00) in NN303, which are truly crippling. I keep these and others in a small database for testing.
On Fri, May 18, 2018 at 1:47 PM, Andy Olsen notifications@github.com wrote:
The self-play ratings are not perfect, but they correlate reasonably well with external tests. They are good enough to show around ID271 things continued to get worse, and from ID 287 things started to get better. External Elo tests agree with this overall trend.
I'm aware that ID237 vs ID303 doesn't match up with the graph, but I don't think that invalidates my point. I expect the new nets will surpass ID237 in all Elo tests soon.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/glinscott/leela-chess/issues/610#issuecomment-390265983, or mute the thread https://github.com/notifications/unsubscribe-auth/ADbG16i2iGERZHe0nCORuxQAGWBdf2aKks5tzvsKgaJpZM4UANGq .
Do you have a link where we can see your results? Feel free to add a link to your data in the wiki FAQ here!
Actually the most recent tunings were done with gauntlets of other engines that it will face soon at TCEC, but it had a similar result to the self play tuning. And I can easily provide the counterexample of SF being tuned for self play and wiping the floor with the competition. But it is a valid point and worth testing self play and others...
@killerducky I posted them in the forum here:
https://groups.google.com/forum/#!topic/lczero/c1A5ioOv1K4
I also added the NN237 results in the last post of that thread. I can provide games and breakdown of results of course. FENs for positions, you name it.
@killerducky - I have been following this issue very closely since it began, so I am of course well aware of the Elo graph, the wiki summarizing the issue, as well as many discussions about it and my own investigations.
Of course the Elo graph would continue to go down after the (rule50 and all 1s) bugs were fixed. The training window still had bad data. Furthermore, once the training pipeline was filled with data from v10 engines, it makes sense that the Elo graph might continue to slide for a while as the network needed to adjust to the v10 engine correct data. E.g., there are plenty of examples where a large rule50 input hurts correct (v10) engine evaluation. So yes, fixing a bug like this can have a short term negative impact to network strength.
As to asking me about data that shows that the primary issues were the bugs... That's simple. We know that engines prior to v10 were generating incorrect policy and target output. We can also see that evaluations with allones and rule50 turned off vs on diverged more and more throughout the regression period (there's a bunch of people reporting stuff like this, I can prepare some graphs to prove it). The data generated from self-play is what is used as the policy and value targets during training. Therefore, yes, I think we can be confident the bugs were the primary source of the issue.
Furthermore, there's the question of why wasn't this seen sooner, if it were due to the all-ones/rule50 bugs (which had been in engines for a while)? My assumption is that as a result of regularization, when there were no bugs, the network learned to generalize (mostly) correctly to the case where those inputs weren't provided. So the older bugged engines were still able to generate mostly decent training data (but not great - I expect that if there were no bugs, Elo graph would have continued to improve very fast after 10x128 to 15x192 change). This is evident in the divergence between network evaluations with rule50/all-ones on vs off that I noted in above paragraph - older networks did better in buggy engines... But it was inevitable that eventually the network would begin to learn and overlearn these features as the engine kept ignoring them during self-play, but trained on them during training. A performance-degradation snowballing effect is a kind of obvious result, and is seen in the Elo graph.
If we created training data doing self-play using networks with all inputs set to 0, do you think that would cause issues? Extrapolating this, yes, data generated from networks that have faulty input is bad data.
Now I believe it would be in the project's best interest to move on from the bugs by understanding also that PUCT tunings that happened with buggy engines and their related buggy networks are bad data as well. Those tuning values should be ignored; they are no longer meaningful.
I hear you are interpreting the data differently than me. The v10 bug fixes were in place for 3 days, enough to fill the entire window. The value_loss_weight showed an instant improvement. The rule50/all1s/puct/fpu theory makes this instant change a coincidence, so I think that's less likely than the oversampling theory.
The bugs of rule50/all1s were in for a long time. But the problems started soon after v8. My theory is the puct change in v8 exacerbated the long standing issue of oversampling.
I think my theory (oversampling) explains most of the data we see. Other theories are possible, but IMO are less likely.
Do you see any evidence against the oversampling theory?
I also hear you asking to retune PUCT now that we have less buggy code and better nets. I agree with this part, and people are doing it. So far the new tests show the current value is still good. This issue was opened proposing a different method to tune PUCT. It's harder to prove or disprove how good that proposal is because testing it is much more expensive.
Edit: I was just skimming chat and I saw someone posting some self-play(? or maybe it was vs other engines) results suggesting we should lower fpu_reduction. Again I totally agree we should probably retune PUCT, fpu_reduction, and others using self-play.
A point about reducing the value loss weight... It's well-known that reducing effective learning rate (after a network has plateaued) can result in a sudden and sharp decrease in loss, increase in accuracy/strength, etc - as evidenced by Alpha Zero paper as well as many many other places. But that happens as a result of reducing bias in favor of variance, i.e., fitting more to the data, not less... And a reduction in loss weighting is essentially the same as a reduction in learning rate.
I understand that this idea is countered by the argument that the reduction in value loss weighting causes a relative increase in the reg term weighting, and that increase is obviously driving down the reg term (as evidenced in the TB graphs) -- thus less fitting, and more generalization... But decay in learning rate alone can cause regularization loss to decline, along with overall loss. It doesn't necessary mean that overfitting has been reduced - it just means that the model is optimizing to its loss function on the dataset. E.g., see graphs posted here: https://stats.stackexchange.com/questions/336080/when-training-a-cnn-why-does-l2-loss-increase-instead-of-decrease-during-traini?rq=1 (this was the quickest example I could find).
The proper way to measure overfitting is to compare a training set on a completely different validation set, that's not used during training. I am unaware of any such metrics, because test/train data is mixed (as stated in another issue here).
Anyway, yes it's clear that the reduction in value loss weight is related to the sudden increase in ELO. But it's not clear if the reduction would have any effect if it weren't for the training pipeline being cleaned up by churning through the rest of the buggy engine data. And it's not clear that this has anything at all to do with a reduction in overfitting. And it seems very very unlikely that "overfitting" alone would cause such a sudden and dramatic decline, without buggy data. Finally, the whole ELO graph is suspect, as it was created with buggy engines.
So to answer your question about evidence -- first, I don't think there's any real evidence to suggest that the sudden drop in ELO had anything to do with "overfitting", as opposed to simply "fitting" to bad data. Second, yes, I am trying to create more substantial evidence for the specific interactions between bugs/metaparameter tuning/etc - but working out the best ways to properly measure things, as well as the timeline of all the changes and how they may have impacted things, has been taking time.
And it seems very very unlikely that "overfitting" alone would cause such a sudden and dramatic decline, without buggy data.
What is your theory about what did cause the sudden decline?
Anyway, my gripe with the word "overfitting" is I feel it may be leading decisions in this project astray, when right now it's most important that network recover from issues - and I don't think reduction in value-loss-weighting is the solution to those issues... But I do agree with what's been said about things like oversampling, and a motivation to be more in line with AlphaZero's methodology.
My theory is that there was continued divergence of network output between the network with proper rule50 and all-ones plane input and the network without - that is, a divergence between the way things were being trained, and the way training data was being created. As stated earlier, I think if the network was trained on only good data, then because of regularization it would output mostly correct results with those two planes turned off. Once the bugs were introduced, it took time for divergence to happen, but once it started, it makes sense that it would accelerate in the self-play=>training data=>training feedback loop.
Perhaps this divergence became much more pronounced when the PUCT and FPU changes were implemented, causing more and more divergence until it got to a point that it was overlearning effects from the rule50 and all-1s plane to such a degree that it hurt any positive effects of self-play training.
Since the current CCLS gauntlet shows that tactics were in fact recovered with the current PUCT value, I revised my opinion somewhat. A simple tuning for strength at high visit counts now seems fine as a solution for PUCT.
I have no idea how they test, but right now NN314 is getting slaughtered by Spike 1.4 in a same test that NN237 lost only by 50-60 Elo to. To be honest, the only ratings lists I trust because of their rigorous training and settings methods are CCRL, CEGT, and Ingo's gauntlets of 3000 games.
So yes, that means all other results carry little or no weight to me unless I am convinced they were conducted under equally rigourous methods: identical executables, openings, and of course computing power. That last is particularly crucial when testing and comparing Leela versions.
On Sat, May 19, 2018 at 6:26 AM, jkiliani notifications@github.com wrote:
Since the current CCLS gauntlet shows that tactics were in fact recovered with the current PUCT value, I revised my opinion somewhat. A simple tuning for strength at high visit counts now seems fine as a solution for PUCT.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/glinscott/leela-chess/issues/610#issuecomment-390392357, or mute the thread https://github.com/notifications/unsubscribe-auth/ADbG18g9QAvjBFcBGTnuQjRNNf2edAdeks5tz-UxgaJpZM4UANGq .
I posted detailed tactics results using a corrected and revised WAC file: https://groups.google.com/forum/#!topic/lczero/rrj71g6aFJw
KillerDucky wrote
If we cannot tune based on self-play results, then we would need to tune based one ~1 week experiments with hyper-parameters. Maybe we're really in that situation
I think we really are in that situation because so much resources have gone into our pipeline that we should not risk messing it up based on out-of-context-match-play test results (or attempts to map Deepmind's words onto our situation). If a change to our precious pipeline is worth implementing, it is worth testing in as valid a context as possible. Moreover, 5500 Euro was raised for new hardware, and LC0 shows great potential, so I imagine that less than 1 week might be needed.
Currently over 400 clients are feeding self-play games into the same pipeline; I think our project would be more robust if we were to use part of this resource for "experimental mini-pipelines" instead.
I have found it hard to understand that oversampling is mainly to blame, mainly because the situation (in which different engine versions with different features, bug fixes, or parameters values have been used (sometimes concurrently), whilst various training parameters have also been changed occasionally) has been so confusing.
Disclaimer: I do not write code, so I lack insight into how much work it would be for Someone to implement what I am suggesting.
Actually, something new came up today in the Discord channel. A user ran CLOP to optimize settings for LC0, which used the default settings of LCzero. He came up with this:
Of note is that although tuned for higher NPS and a GTX1060, the optimal value was PUCT at 2.8, and FPU at -0.08
How can I know which y-axis values go with which labels? What is the x-axis?
There are numbers and colors. In order to keep it all in one graph he placed values on each side to illustrate all three. Therefore on the left are values up to 3.0 which apply to both the slowmover and cPUCT values, and the small values on the right are for the FPU reduction. cPUCT is stable at around 2.8, slowmover is at 2.75 (roughly) and FPU reduction is about -0.08.
On Sat, May 19, 2018 at 7:36 PM, Andy Olsen notifications@github.com wrote:
How can I know which y-axis values go with which labels? What is the x-axis?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/glinscott/leela-chess/issues/610#issuecomment-390437184, or mute the thread https://github.com/notifications/unsubscribe-auth/ADbG1ztC5toP0F9TrsjoKr7QHu8q-tURks5t0J5_gaJpZM4UANGq .
I'm currently tuning parameters for LC0. I didn't have experience with CLOP before but what I found out is that CLOP often picks a local maximum at the beginning of a tuning session and never finds the global maximum in that case. I would suggest tuning the parameters one by one first and also make sure the winrate stays at 0 for the first 10-20 iterations (by choosing a strong enough gauntlet opponent engine) while CLOP is testing parameter values across the complete bounds range so that the first win/draw is found at the global maximum. The results above look like the tuning is not finished yet as well. Edit: 10-20 iterations at winrate 0 is too much, let's say at least 8-10. Several restarts can also be considered to check if the first wins/draws are roughly in the same parameter range on each restart. Currently I don't have results for longer time controls yet, but for very short time control it looks like Puct=1.7, FPU=0.1 and ScaleThinkingTime=2.9 are the optimum values. I will post results as soon as they are complete.
@zz4032 The problem, as mentioned in the Discord discussion, is two-fold:
The point of my posting was not that I or anybody else should prefer short time control tuning over long time control. I was pointing out a problem that might be present when parameter values differ that much from the standard values as Puct in the example above and actually I was addressing the person who created the graph or others, who plan to try CLOP tuning. On several tuning sessions I tried for "Puct" (as a single parameter, which is difficult enough) half of the times CLOP converged to ~1.5 for LCZero (which would be 3.0 for LC0). Direct match results confirmed that it is a local maximum with far inferior performance.
You're acting like very short time control is not something worth to talk about. Top GMs might have never heard about Fishtest but I think you did. Also trends in parameter changes can be investigated when doing several tests with increasing time control and comparing results. Optimum parameter values don't suddenly change by a huge amount at different TCs.
"Optimum parameter values don't suddenly change by a huge amount at different TCs." Actually, they do here. Also, the FPU and PUCT values have similar effects and cannot really be tuned independently. There is a direct synergy.
There are strong indications that optimal values for PUCT and FPU reduction are dependent on node count. For instance, ELF OpenGo was tuned for PUCT 0.4, while Leela Zero always used 0.8. It seems that higher node counts work better with a low PUCT than lower node counts, this is likely the same for Leela Chess.
Except tunings tend to have higher puct for longer tc and lc0 even higher...
On Sun, May 20, 2018, 4:50 PM jkiliani notifications@github.com wrote:
There are strong indications that optimal values for PUCT and FPU reduction are dependent on node count. For instance, ELF OpenGo was tuned for PUCT 0.4, while Leela Zero always used 0.8. It seems that higher node counts work better with a low PUCT than lower node counts, this is likely the same for Leela Chess.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/glinscott/leela-chess/issues/610#issuecomment-390511464, or mute the thread https://github.com/notifications/unsubscribe-auth/AO6INIsyRqlTuJZn7ahyKYRigi39HnJUks5t0dckgaJpZM4UANGq .
There is another theory that PUCT should change depending on the tree size, and thus you should use a different puct deeper into your tree and later on in your time control. A static PUCT value is likely not optimal.
Tuning # 1: 12s +0.05s/move ( ~2000 nodes/move) Tuning # 2: 48s +0.2s/move ( ~5000 nodes/move) Tuning # 3: 192s+0.8s/move (~25000 nodes/move) Tuned values were still fluctuating a lot and results should be taken with caution, extrapolations even more. Nevertheless: Results from tuning at ~15s/game were excluded for reasons mentioned below.
My conclusions for LC0: 1) Tuned Cpuct parameter increasing with TC (as in previous LCZero tuning), recommended change 1.2 -> 2.0. 2) Tuned FPU parameter decreasing with TC (as in previous LCZero tuning), standard value of 0.2 looks fine. 3) Tuned ScaleThinkingTime (SlowMover) increasing with TC. Standard value was changed from 1.5 to 2.2 in the meantime, looks fine... 4) Tuning run # 1 at ~15s/game TC (~2000 nodes/move) shows much lower values for FPU Reduction than expected which doesn't fit into the trendline. Also tuning convergence was reached unusually late -> influence of "damaged" network value head at ~2000 nodes/move possible? Tuning LCZero v0.10 Id251 at this TC (even less nps) didn't show any inconsistent results. Tuning # 1 was partly repeated just for a test purpose and showed very similar results for a second time. EDIT: I think I have identified the reason for unstable tuning progression: time losses because of Sclaethinkingtime.
Trend lines from LCZero v0.10 Id251 tuning for comparison: One more time an important remark about CLOP tuning: As mentioned in my postings above I observed CLOP sometimes falling in local optima (proven as performing much worse in direct matches against standard parameter values). It happend a few times on ~15s/game for Cpuct (converging immediately to about 2.8) and Slowmover (about 3.0). Doesn't mean those values are wrong on long tc, performance should be investigated in matched with reasonable number of games and error bars.
@zz4032 I think your CLOP run was buggy. I had a similar result at first, consistently so, with wins, draws and losses too, even the same values of 2.2ish and 0.2, but what tipped me off was that I was getting far too many games in the time allotted, and the losses were outweighing the wins by disproportionate numbers, even after many hours. I didn't pay too much attention to this, figuring it was the settings being experimented with, but it was a lot of strangeness for my taste, and I then saw in the task manager that the GPU was not being used at all. I spoke with a few old pros who use CLOP and they told me the safest way was to ditch this and set up a proper engines.json file, which is what I did. A new run and the results started making sense again, and the GPU usage was now back to 97-98%. Also, with the exact same engines (LC0 and Wasp 3.0) instead of a -80 Elo, it is now positive, and I am getting half the games I did, also a healthy sign.
Edit - One last tip: if you don't add the command order in the .py file, such as order=random, it will repeat the same opening no matter how large the openings suite. You can double-check how all is going with the option -pgnoutput filename (and location). Then you can review the games, with depth and evals.
Thanks, but I didn't notice low GPU usage in my tuning runs. Also calculation time per move looks fine. I had order=random set for the opening file as well. It looks like CLOP restarts cutechess for each game, probably that's why with sequential book position order it always starts with the same (first) position. What I did notice though is many losses on time. Even some for the 48s+0.2s/move tuning. Also happens in cutechess matches without CLOP. I'm going to post an issue about this in the next days.
Yes, I had many losses on time before using the .json file, which is why I threw out the results, as they could no longer be considered valid. I cannot say why this would make a difference mind you. I'm adding my .JSON file here in case you want it as an example to work with.
Place it in the CLOP folder to use, and the cutechess folder if you want to test with cutechess-cli. Here is what the .py lines look like:
engine = 'conf=lc0-may22'
engine_param_cmd = 'setoption name {name} value {value}'
opponents = [ 'conf=Wasp' ]
It turned out the reason for time losses was not the parameter ScaleThinkingTime (Slowmover) but just LC0 losing at time control with small increments missing the clock by a small amount of milliseconds. A recent commit adding move overhead solved the issue completely (MoveTimeOverhead=10). In the meantime I reran the tuning at ~15s/game and got basically the same results.
GCP published data, which I do not doubt, that in head-to-head matches, a smaller PUCT value, with fewer moves but taken deeper, led to better results for the smaller PUCT value. The problem is that this will tend to reinforce its current choices as opposed to encouraging it to explore moves it might find surprising such as... tactics.
I'd like to suggest the PUCT value actually be increased to allow it to test out moves or situations it does not master, and learn from them. I do not think seeing fewer moves, but deeper, is the ideal way to evolve and learn.