v1.8.2 Contribute Slower Than v1.8.1?

EZonGH commented 3 years ago

The new v1.8.2 seems to run contribute much slower than 1.8.1 on my machine. I am running it on windows 10 with an Nvidia GTX1650 super (no tensor cores, no FP16) using the opencl version. Moving from 1.8.1 to 1.8.2, I am seeing: -33% on plays per second, -30% on nn evals per second, and +61% on seconds per game. This is from the information in the log files.

EZonGH commented 3 years ago

When I run the cuda version on google colab, it does not seem that there is any significant difference between 1.8.1 and 1.8.2

lightvector commented 3 years ago

Thanks! Were there rating games being played at that time in one case but not in the other? Rating games will generally result in a slowdown of the overall processing speed, because now your machine is running two or three different nets instead of just one, so it is possible that both are the same speed but one of them was measured in the presence of rating games while the other was not.

If necessary, you can specify an undocumented parameter maxRatingMatches = 0 in your .cfg file to disable them while you try to benchmark.

Even rating games are costly, they are of course still useful or important in the long run to be able to track KataGo's progress and to also know if at any point there is a problem that causes the quality of training to severely drop.

cryptsport commented 3 years ago

I see a decrease of 10 - 20 %% in the eigenavx2 version and 2.5 times in the OpenCL version. I run into sabaki. two 20b networks play. The default_gtp.cfg files are the same, 10 playouts

cryptsport commented 3 years ago

I don't know if maxRatingMatches = 0 should affect when playing, but I tried it, nothing changed

lightvector commented 3 years ago

Thanks, I'll investigate tonight. There should be no changes at all that affect gtp, so maybe something changed with compiler settings.

lightvector commented 3 years ago

I'm observing no difference, which is what I would have expected prior to this issue, since 1.8.2 does not touch any of the search or engine code or neural net code - there are zero changes to that logic since 1.8.1.

On Linux, running the OpenCL version on a machine with a V100, I get around 1700 visits/sec (40 block network), regardless of 1.8.1 or 1.8.2. On Windows, running the OpenCL version on my laptop which has a weak Intel HD Graphics, I'm getting around 65 visits/sec (20 block network), regardless of 1.8.1 or 1.8.2.

So... @EZonGH @cryptsport - can you try running the benchmark command in a command line window as described here, and report what visits/s you get on each version?

@EZonGH - would you be able to check or reverify whether actually they were the same and your measurement was simply affected by the luck of one of them having a set of rating games going at the time while the other one didn't?

EZonGH commented 3 years ago

I ran benchmark on both 1.8.1 and 1.8.2 with a 40b net and the difference was there. However, the cause appears to be that 1.8.1 detected that my card supports FP16 storage while 1.8.2 did not. I copied the ...x19_y19_c256_mv10.txt tuning file from 1.8.1 to 1.8.2 and the difference disappeared.

I checked all the tuning files on my machined: version 1.8.1 tune8_gpuGeForceGTX1650SUPER_x19_y19_c16_mv9.txt shouldUseFP16Storage 0 tune8_gpuGeForceGTX1650SUPER_x19_y19_c96_mv8.txt shouldUseFP16Storage 1 tune8_gpuGeForceGTX1650SUPER_x19_y19_c128_mv8.txt shouldUseFP16Storage 1 tune8_gpuGeForceGTX1650SUPER_x19_y19_c192_mv8.txt shouldUseFP16Storage 1 tune8_gpuGeForceGTX1650SUPER_x19_y19_c256_mv8.txt shouldUseFP16Storage 1 tune8_gpuGeForceGTX1650SUPER_x19_y19_c256_mv10.txt shouldUseFP16Storage 1 tune8_gpuGeForceGTX1650SUPER_x19_y19_c320_mv10.txt shouldUseFP16Storage 1

version 1.8.2 tune8_gpuGeForceGTX1650SUPER_x19_y19_c16_mv9.txt shouldUseFP16Storage 0 tune8_gpuGeForceGTX1650SUPER_x19_y19_c96_mv8.txt shouldUseFP16Storage 1 tune8_gpuGeForceGTX1650SUPER_x19_y19_c128_mv8.txt shouldUseFP16Storage 0 tune8_gpuGeForceGTX1650SUPER_x19_y19_c192_mv8.txt shouldUseFP16Storage 1 tune8_gpuGeForceGTX1650SUPER_x19_y19_c256_mv8.txt shouldUseFP16Storage 1 tune8_gpuGeForceGTX1650SUPER_x19_y19_c256_mv10.txt shouldUseFP16Storage 0 tune8_gpuGeForceGTX1650SUPER_x19_y19_c320_mv10.txt shouldUseFP16Storage 1

To my eye the inconsistent results for the 1.8.2 files look odd, but of course I don't really have a clue what they mean. :-) The result is clear though.

I cannot remember exactly what I did when I updated to 1.8.1 but I am sure that I used genconfig with 1.8.2 when I switched. In the past I always ran benchmark and hand-built my config files, but at some point in the recent past I switched to genconfig. Was that after 1.8.1, during 1.8.1, hmmm...?

Attached is the log of running benchmark on 182 then 181 then 182 again with the turning file from 181.

Benchmark 182 then 181 with 40b net copy tuning from 181 to 182 and rerun 182.txt

lightvector commented 3 years ago

@EZonGH - Cool, that explains it. The tuning file difference or inconsistencies have absolutely nothing to do with the version - as I mentioned above, exactly zero of the engine or neural net or GPU code has any difference between 1.8.1 or 1.8.2, and all tuning files and such are 100% compatible between them. It seems that on either version, the tuning sometimes fails to detect a large enough difference between using FP16 storage or not to realize that using it is better, for your GPU. It's merely luck/unluck that you saw more instances of this when you retuned 1.8.2 (which could have just used 1.8.1's files since again, all the GPU code is identical), than you did when originally tuning 1.8.1.

Which is interesting! I'll think about how to get something more reliable. The tuning runs only a small and quick test of different configurations, otherwise it would take much longer than it already does. Presumably, the problem is this test is small enough that it gets a noisy result and so occasionally for your GPU sometimes due to noise, non-FP16 storage appears to do better in that quick test, so it picks non-FP16 storage.

EZonGH commented 3 years ago

Glad it was an easy fix!

It would be nice if you could make it clear in each release whether we need to retune or not. Also whether we need to worry about changes in the structure of the config files and so on. I generally go through the whole process for each of the net sizes that I may want to use (basically all of them!), which gets a bit tedious. :-)

lightvector commented 3 years ago

The reason why no release ever mentions retuning is because if retuning is needed, there is code that will automatically disregard the old tuning files, without needing help from you. The "8" in the "tune8" in the file name is a tuning-file version, so invalidating old tuning files is as simple as bumping the version number that the code looks for from 8 to 9, in which case it will fail to find any files that start with "tune9" and generate all new ones.

The same for config files. New versions are backwards-compatible with old configs unless explicitly stated otherwise. Although here the guarantee is weaker - we can maintain compatibility, but it is not always the case that the still-compatible old config will be equally good as a newly-generated config, if the new config introduces better parameters that the old config doesn't use.

I'm not sure what fix you're thanking me for - I haven't fixed anything yet, and we're still left with the open issue of how to improve the tuning code so that on GPUs like yours, it more-reliably detects FP16 storage as good when it actually is good. :)

cryptsport commented 3 years ago

I found what is happening for me is not related to different versions. for some unknown reason, katago playing with white, thinks about 2 times more!

q5go-2.1.1-win

182 black 8.7 8.2 7.8 7.0 5.9 5.5 (sec) 181 white 13.6 13.1 12.6 8.9 12.9 12.0

182 white 15.9 13.8 12.7 8.4 12.6 15.1 181 black 8.8 8.0 7.4 6.5 5.0 7.3

Sabaki (2 programs)

182 black 9.3 9.3 8.1 9.1 4.7 8.5 181 white 16.4 14.2 14.1 10.6 12.3 12.9

EZonGH commented 3 years ago

I was playing with the tuner command.

When I deleted the 40b tuning file and ran tuner 10 times in a row (deleting the file before running each time), it used FP16 storage 6 times and FP32 storage 4 times.
When I kept the 20b tuning file and just kept rerunning tuner with the resulting file 6 times, it switched each time ( from FP16 storage to FP32 storage and then from FP32 to FP16).
On both the 40b and 20b experiments, the other parameters were frequently different. This was especially true for the FP32 storage results. I am not sure that any of the resulting files were exactly the same (too lazy to check all the details). For a while I thought that the FP16 storage files were always the same. But sure enough, I found some differences cropping up there as well.

Overall, the noise level seems high. At least on my machine the results seem close to random. Is it possible to write a more intensive tuning experience? LZ has the --full-tuner command. All I know about it is that it takes a long time. :-) I don't know that something similar would be more effective here.

lightvector commented 3 years ago

It's been a while, but as I'm preparing the next release, I'm taking a look at this again.

@EZonGH - If you're still around to test things, I wonder how the following executables behave in terms of consistency of tuning on your hardware? In particular, for each one:

What fraction of the time does it detect FP16 storage as better than plain FP32?
How long does the whole tuning take overall?

katagotuning2.zip katagotuning4.zip katagotuning8.zip

Each zip file has the executable alone. The existing DLL files and config files you have for the 1.8.2 release should work with these experimental exes, the 2, 4, 8 is the changing of a batch parameter used in the tuning, which might affect the accuracy of tuning but also the time required for tuning.

EZonGH commented 3 years ago

Will do!

On Mon, Jun 21, 2021 at 10:55 AM lightvector @.***> wrote:

It's been a while, but as I'm preparing the next release, I'm taking a look at this again.

@EZonGH https://github.com/EZonGH - If you're still around to test things, I wonder how the following executables behave in terms of consistency of tuning on your hardware? In particular, for each one:

What fraction of the time does it detect FP16 storage as better than plain FP32?

How long does the whole tuning take overall?

katagotuning2.zip https://github.com/lightvector/KataGo/files/6683567/katagotuning2.zip katagotuning4.zip https://github.com/lightvector/KataGo/files/6683568/katagotuning4.zip katagotuning8.zip https://github.com/lightvector/KataGo/files/6683569/katagotuning8.zip

Each zip file has the executable alone. The existing DLL files and config files you have for the 1.8.2 release should work with these experimental exes, the 2, 4, 8 is the changing of a batch parameter used in the tuning.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lightvector/KataGo/issues/470#issuecomment-864667040, or unsubscribe https://github.com/notifications/unsubscribe-auth/AREIVEXGR7DHZJZZJBNZ3JDTT2LYDANCNFSM43W73EFQ .

EZonGH commented 3 years ago

@lightvector All of the below was done with a 40-block net. I ran benchmark on the existing katago.exe to confirm nothing had changed (there was at least one NVIDIA driver update in the meantime): 5 runs, 3 FPStorage true, 2 FPStorage false so problem still exists. I ran benchmark 20 times on the katagotuning2 exe file: 14 FPStorage true, 6 FPStorage false so reject this one I ran benchmark 20 times on the katagotuning8 exe file: 18 FPStorage true, 2 FPStorage false so reject this one also I ran benchmark 30 times on the katagotuning4 exe file: 30 FPStorage true, 0 FPStorage false so this looks like a keeper

I forgot to log the times on the 1.8.2 exe but I know that the first time took about 4 minutes. The katagotuning4 exe averaged about 4 minutes 10 seconds across the 30 tests so very similar. It was the fastest of the three test executables on my machine.

One caveat. All the versions create different parameters in the tuning files every time they are run. There were too many parameters for me to check in detail. But, running file compare showed that very few files are identical. I tried running contribute using a couple of randomly selected tuning files and observed that one was about 1% slower than the other (based on reported nn evals). I do not know if some of the other files would be substantially better or worse.

EZonGH commented 3 years ago

@lightvector I am so careless that I did not realize until after posting above that I had been using tuner the last time around. I redid the katagotuning4 and katagotuning8 runs using tuner instead of benchmark just in case there was a difference (it is only 8 times faster :-) The results are the same the 4 version works while the 8 version does not. Tuner runs around 30 seconds on my machine.

lightvector commented 3 years ago

Thanks!

I would have hoped that it working would be monotonic with the tuning batch size (so that 4 is better than 2 but 8 is better than 4), since that would be easy, the fact that you found 4 to be best probably means that we should switch to 4 but that it's fragile - I could easily imagine that the "optimal point" would be different for someone else with a different machine.

The other thing I'm interested in that I'm not sure I understood your stats on: how long does the tuner take (not the whole benchmark, just the tuner), for each specific version? 2, 4, 8.

Because based on your result, we should at least move to batch size 4, but the problem is if that noticeably magnifies how long the tuner takes, then that is a nontrivial cost to be aware of, since the amount of time it takes is already burdensome on some machines.

EZonGH commented 3 years ago

Time for tuner only (average of 10 runs): katago version 1.8.2: 22 seconds katagotuning version 2: 24 seconds katagotuning version 4: 30 seconds katagotuning version 8: 40 seconds

lightvector commented 3 years ago

Great! So from your result it looks like there is a little bit of a tuning speed cost for going to batch size 4, but it's not that large, and given the greater stability of the FP16 storage result on your hardware, it seems like that's likely to just be an improvement for most people. Perhaps more could be done, but I'll plan to make at least this small change for the upcoming release. Thanks for the testing!

lightvector / KataGo

v1.8.2 Contribute Slower Than v1.8.1? #470