lightvector / KataGo

GTP engine and self-play learning in Go
https://katagotraining.org/
Other
3.6k stars 569 forks source link

KataGo parameter tuning #508

Open lightvector opened 3 years ago

lightvector commented 3 years ago

Just opening a thread to have this discussion so that it doesn't get buried and lost in comments on commits:

@sbbdms

Hi! Thanks very much for accepting my conclusion of the previous test!

I noticed that you also changed the default value of cpuctExploration and cpuctExplorationLog. Could you disclose that how much elo that is gained from this change? I just started to test some other pairs of parameters half a month ago, with a similar procedure, but 1000 playouts per move to the previous test. Currently I have ended up testing one pair of parameters, and found a pair which might seems to be about 25 elos better than the original one. However I am using the previous cpuctExploration and cpuctExplorationLog as default. The newly-changed default values are out of my expection. I don't know how much the newly-changed default values gain, or whether I should give up my new test and restart with the new default values, or just keep it...

Thanks to you to for helping test things out!

The change to cpuctExploration and cpuctExplorationLog was due to some recent experimentation by @fuhaoda. It seems that after the UEC parameter changes, the optimal absolute level of cpuct needed to be retuned as well. It is probably a gain of a few Elo, probably less than 10:

image

These are Elo values of configurations of cpuctUtilityStdevPrior and cpuctExploration with different numbers of playouts per move. Each cell contains a few thousands of games.

At least according to this test, we can see that the change to increase cpuctUtilityStdevPrior to 0.40 also means that the cpuctExploration should rise a little bit. Each increment from cpuctUtilityStdevPrior from 0.25 to 0.32 to 0.40 results in the optimal cpuctExploration shifting to the right, from somewhere in between 0.675 and 0.9, then to 0.9, then to somewhere in between 0.9 and 1.125. In this test, cpuctUtilityStdevPrior 0.32 was actually a little better than 0.40 for the 200 and 600 playouts, but that may partly be noise, and at 1800 playouts, it wasn't clearly better again. Combined with your testing, I chose to go with the cpuctUtilityStdevPrior 0.4 that you concluded with, and adjust cpuctExploration upward as this data pretty clearly suggests should be a bit better if we use cpuctUtilityStdevPrior 0.4.

If you have more suggestions for parameter improvements, with testing data to support it, I'm always open to hearing it! 25 Elo is a lot, and if it is robust, probably would not be affected too heavily by a change like this that likely merely optimizes a few Elo.

Whether it is worth starting a new test from this baseline right away or not depends on what parameters you're testing, some of them are expected to behave more independently while some are probably correlated. And of course, regardless of whether you want to restart, no data is a waste - even with different baseline parameters, data that a parameter change is good is still clear evidence that helps us understand the landscape of parameter effectiveness.

sbbdms commented 3 years ago

Thanks for your plot and the extra test!

I am a bit surprised that my previous conclusion from https://github.com/lightvector/KataGo/pull/449 is slightly different from the recent experiment:

There were two parts in my test that have the comparison between cpuctUtilityStdevPrior = 0.30 and cpuctUtilityStdevPrior = 0.40. When I used 500 playouts per move and set cpuctExploration = 0.9 (The previous default value), both of the two comparison resulted that 0.40 is better.

However, according to the recent experiment, when using 600 playouts per move and setting cpuctExploration = 0.9, the elo is 22 when cpuctUtilityStdevPrior = 0.40, and the elo is 35 when cpuctUtilityStdevPrior = 0.32, which is very close to 0.30. So in this experiment, 0.40 could probably be worse than 0.30 at 600 playouts per move (Though may be better when playouts per move and cpuctExploration increase).

In my opinion, there are some difference between my previous test and the recent. They might be the reasons that cause the different conclusion:

(1) The amount and strength of opponents I set in the previous test was quite different. I mentioned above that there were two parts which resulted cpuctUtilityStdevPrior = 0.40 is better. In one part, I rated 48 bots with different cpuctUtilityStdevPrior and cpuctUtilityStdevScale (Also, all these bots used uncertaintyExponent = 0.90 and uncertaintyCoeff = 0.40, which was placed at 1st in the previous method of my test, but it is different from 1.00 + 0.25 which we are currently using as default value). In the other part, I compared cpuctUtilityStdevPrior = 0.30 and cpuctUtilityStdevPrior = 0.40 directly, which means there are only two bots in the rating list. Different amount and strength of opponents might affect the performance and elo of cpuctUtilityStdevPrior = 0.30 and cpuctUtilityStdevPrior = 0.40.

(2) I used different older neural networks (Successively b40s721, b40s785, b40s790 and b40s809 in totally 4 parts).

(3) I have much less games played by per bot, mostly only 400~500. In the recent experiment, each cell contains a few thousands of games... (Strong!)

(4) Noise which is always unavoidable.

Because in the recent experiment, each cell contains a few thousands of games, the conclusion from the recent experiment should be more robust and convincible. So please regard my conclusion as reference only, and modify to the result from further convincible tests!

However I may continue my new test with the previous default values of cpuctExploration and cpuctExplorationLog, because 25 elo-gain the result shows recently, are indeed attractive. I will test whether there are better values to cpuctExploration and cpuctExplorationLog after all, however I am not sure whether this elo-gain needs their previous default values, so I decide not to immediately change them when testing other pairs of parameters.

I wish that my new elo-gain is not made by noise. If I have a good result after all my new tests, I will upload the data here for you to validate. Thanks again!

sbbdms commented 3 years ago

@lightvector

Hi!

I am uploading the SGFs, logs and descriptions of my latest custom test. Because the file is too large, it has to be separated into 4 parts, which suffixes are .zip, .z01, .z02 and .z03. Github doesn't support uploading files with .z01, .z02 or .z03 suffixes, so I manually added ".zip" after these suffix, please delete it before you unzip them together.

Please point out if there's any mistake in my test! Also, I am seeking for more assessment to my final set of params (Especially its overall strength with different strong networks). If its overall strength proved to be weaker in your further assessment, then you can consider assessing the 2nd/3rd/4th or more following set of params. I wish that there's at least one set of params from the test helps!

KataGo params testing (202106-202110).z01.zip KataGo params testing (202106-202110).z02.zip KataGo params testing (202106-202110).z03.zip KataGo params testing (202106-202110).zip

Harder-Run commented 3 years ago

@sbbdms May I ask why some SGFS you uploaded cannot be opened?

sbbdms commented 3 years ago

@dergo853

Did you delete ".zip" suffix manually, to the ".z01.zip", ".z02.zip", ".z03.zip" files?

Because the file is too large, it has to be separated into 4 parts, which suffixes are .zip, .z01, .z02 and .z03. Github doesn't support uploading files with .z01, .z02 or .z03 suffixes, so I manually added ".zip" after these suffix, please delete it before you unzip them together.

If you follow the instruction above and still cannot open some of the SGFS files, could you please point out the exact SGFS files or directories? I will have a check whether they are corrupted, however there were no exceptions during the match, in my memory.

Harder-Run commented 3 years ago

Thanks. I forget to delete ".zip",now it's OK.

lightvector commented 3 years ago

@sbbdms - Sorry for the long delay in responding to this - I've finally taken a look now and this looks like a very useful set of results! I can definitely believe that subtree value bias factor should be increased. I think this can indeed form the basis of some further experimentation for setting the parameters in a next release. Although the next release may not be particularly soon, I will definitely take your findings into account, just as on an earlier release where we did the same. Thank you!

Thank you also for the comparison with 40b and 60b at the end of step 3 - I agree that switching to 60b seems about time as they are similar, and on KataGo training https://katagotraining.org/ ​for a little more than a month now we have been running with lower 40b learning rate to get a final boost in quality for 40b before we switch to 60b. Unfortunately since data generation rates are fairly slow we might stick on this for a month or two more, but we should switch eventually.

Interestingly enough, we did appear to get a strength boost from dropping the LR for 40b, which probably means that 60b is now "behind" 40b by a little, but I don't think this should stop us from switching, we can just run for long enough to benefit from the better data for a training window's worth or two, and then we can switch anyways and see what happens.

sbbdms commented 2 years ago

@lightvector

Hi!

Could I know about your assessment to these pairs of params from the result above?

I just noticed that you have merged the graphsearch branch into master, it seems that the new released version is nearly coming out. However the params in the configuration files are still the original...

lightvector commented 2 years ago

@sbbdms - Yes, there will be a followup change shortly adjusting the defaults.

The adjustment to cpuct and change to LCB I would prefer to leave undone because the effect is relatively small and leaving these values slightly higher at their defaults has some benefits for reducing blind spots and encouraging exploration in games outside of self-play, where the positions may be less familiar. In particular, if the minVisitPropForLCB is too small, in more unusual positions you often see KataGo erroneously favor bad moves when too-few visits makes them appear good.

However, I'm pretty sure we will be adopting the change to subtree value bias from your test. Thanks for the detailed data! It looks good and I can confirm the value works fine in independent tests.

lightvector commented 2 years ago

v1.11.0 is released with the parameter changes for subtree value bias! Thanks!

sbbdms commented 2 years ago

@lightvector

Thank you very much to accept the subtreeValueBias params as default! I may start a new round of test soon, where another 3 pairs of params will be tested --- Similar as previous test. However I have 2 questions:

(1) Personally I still don't want to discard the previous LCB and cpuct params when I am running the custom bot. If their effect is relatively small, would you mind me to set these params as default in the new test? (2) Do you think any param relatively has potential to be better, which is recommended to be tested?

Thanks!

lightvector commented 2 years ago

Thank you too!

  1. The LCB params should have absolutely no problem setting them as a default for your tests, and I would not expect them to have a significant interaction with other parameters, so mostly any results you find with them are likely to hold even for other LCB parameter settings.

The cpuct params may slightly affect the optimal value of other parameters more than LCB. It may be the case that if you do a lot of parameter tuning, then it will end up that if using a different cpuct, the optimal values of all the other parameters will shift a bit too. However, the effect is probably not too large for only an 0.05 difference in cpuct, so if you are determined to use your value of cpuct, feel free, I would recommend that in any final rounds of testing you do you also include the "base" v1.11.0 configuration as a comparison as well so that we can get additional data to compare.

  1. For high-promising potential parameters to test, remind me again - do you happen to have the ability to compile a custom version of KataGo on your own? I just now pushed a branch https://github.com/lightvector/KataGo/tree/parentweightbypolicystdev that may interest you. It adds a simple change to make KataGo prefer to explore branches where the raw policy appears to be more wrong, by downweighting the amount that each visit counts so that the search performs more visits on that branch to reproduce the same weight.

The experimental parameters that control this feature are: reduceWeightByPolicyUtilityStdev - should be a number from 0 to maybe around 10 or so. This controls how may visits reduction is considered to be worth. If it is 0, then it is equivalent to entirely disabling this feature. reduceWeightByPolicyUtilityStdevBase - should be a number from 0 to maybe around 0.3. This controls the minimum standard deviation of utility needed to trigger this behavior. It's possible that the optimal value is 0, and if that were the case, it would be really nice because then the code would be more elegant. reduceWeightByPolicyUtilityStdevRate - should be a number probably between 0.2 and 0.8. If you set it all the way down to 0, then it is also equivalent to entirely disabling this feature.

Anyways, if you are interested, and able to compile this branch, this looks like an interesting thing to test.

Another thing that has not been tested a lot is how much different values of staticScoreUtility and dynamicScoreUtility affects the playing strength in even games. These two values are known to be absolutely critical in good handicap game play, but I don't know of good recent tests on their effect on even games or analysis quality.

sbbdms commented 2 years ago

Hi!

(1) Thanks! I am going to use the LCB and cpuct params from the previous test, as default. After new params come out, I will compare them with the default LCB and cpuct params in v1.11.0, in "Step 3 (Assessment to the winner param)".

(2) Thanks for the explaination to the new branch as well! I am able to compile the code and build a custom version of KataGo, however currently I tend to test params in the released version...

IMO a new branch is more likely to be modified much, such as fixing bugs in the code or modifying/adding new algorithms. Either of them could make the generated games invalid, thus the test needs to be restarted.

Since the duration of a complete round of test is quite long here --- The previous test cost for about 4 months, I would like to test params in the latest released version which is stable.

However I will surely pay attention to the new branch :)

staticScoreUtility and dynamicScoreUtility seems to be a good choice to the test, I will make it into the plan.

(However I am a bit surprised by your description that these params are absolutely critical in good handicap game play, but unsure to the even games. In my memory, you used to recommend much higher value to dynamicScoreUtility in handicap games: https://github.com/lightvector/KataGo/issues/417)

lightvector commented 2 years ago

IMO a new branch is more likely to be modified much, such as fixing bugs in the code or modifying/adding new algorithms. Either of them could make the generated games invalid, thus the test needs to be restarted.

Heh, no problem. Note that a new branch is also where the testing is much more valuable, since rather than only adjusting parameters, the results of tests can influence the entire choice of algorithm. The only reason the branch can be modified or accepted or discarded is because of testing it! But if your timetable for testing is 4 months then I agree it makes more sense to leave the testing for experimental features to other people.

However I am a bit surprised by your description that these params are absolutely critical in good handicap game play, but unsure to the even games.

What makes you surprised? Basically, what I mean is that if you didn't have these parameters at all (e.g. you set them to 0), handicap game play quality will almost surely suffer a lot. Because these parameters are what control KataGo's desire to increase the score, rather than just win and lose, and in handicap games, if you are behind by 6 stones, every move is 100% losing, so you need to focus on first increasing the score in order to catch up, before you can have any chance to win/lose.

But as I said, it is relatively un-tested how these parameters affect even game strength!

sbbdms commented 2 years ago

@lightvector Hi!

Thanks to the TensorRT version, the speed of running match games is about 33% faster than the match last year. Now the new match is nearly finished. However, the result is a bit confusing, and is quite behind from my expectation. I may have to check what could be wrong in the settings...

I have a main question to consult: The default value of the param "graphSearchRepBound" is currently 11. If it goes lower like 4-6, or it goes higher like 17-19, then what will theoretically happen, and how will the strength change according to your test?

lightvector commented 2 years ago

graphSearchRepBound controls the maximum length of a superko cycle that we check for in the algorithm.

If you set graphSearchRepBound higher, it will do absolutely nothing except waste computation power, because superko cycles of length 11 are already extremely rare. If you set it smaller, you will mildly increase the performance of graph search, but too small will cause KataGo to potentially make rare mistakes in ko or superko situations that involve cycles of that length, mistakes that will not be fixed even with infinitely many visits, because you are forcing a flaw in the algorithm itself by ignoring those cycles.