lightvector / KataGo

GTP engine and self-play learning in Go
https://katagotraining.org/
Other
3.47k stars 563 forks source link

Possible memory leak issue with Katago 1.12? #756

Open yauwing opened 1 year ago

yauwing commented 1 year ago

I tested 400 games at 1000 maxplayouts 7.5 komi between two identical versions of Katago 1.12 running TensorRT using LizzieYZY GUI on a 4 GPUs system I noticed from task manager that the memory usage of katago.exe grew from 1.6GB to 2.4GB gradually. If I re-test without quitting LizzzieYZY's memory usage of katago.exe would grow from 2.4GB If I quit LizzieYZY and re-test LizzzieYZY's memory usage of katago.exe would be 1.6GB again

Since there are two copies of katago.exe running simultaneously the memory usage was growing by about 4MB per game

BTW, the test I am doing is trying to see if the initial win rate indicated when the board was empty matches the actual end results. So far it is within 2% at 1000 maxplayouts

lightvector commented 1 year ago

Does it keep growing beyond 2.4GB? If so, how much? Is there a limit or does it grow unboundedly with more and more games?

What is your nnCacheSizePowerOfTwo in the config set to? The default setting of the cache would not be filled up by 1000 playouts per move of a single game, so with default settings it would be expected to grow mostly over the first 5-20 games as more positions become cached, then grow more slowly over the next tens or hundreds of games due to some inefficiency due to memory fragmentation as the freeing and reuse of memory will not be perfect. Even if there is no leak, we would still expect the memory use to be much larger after many games than if you quit and restart, it would simply be that the growth rate would slow down, gradually approaching some upper limit.

lightvector commented 1 year ago

Also I can already tell you that the winrate indicated when the board is empty should not match except by coincidence. Maybe 1000 max playouts is the right number to make a pretty close match by coincidence. Maybe not. You can test it. :)

This is because the winrate is trained using using the noise and visits and other settings used during self-play traning, because that is what produces the data that the winrate output is tuned to predict. But in actual matches it will vary depending on your settings - if you use very weak settings (very low visits) it should be closer to 50% than the winrate predicts, because there will be more blunders by both sides making the outcome of the game more of a coin flip. If you had the compute power to use extremely high visits (100ks, millions, billions) it would probably deviate increasingly far from both 50% and from the predicted winrate and be closer to 0% or 100% if you were using theoretically-unfair komis (i.e. noninteger komis that do not allow for draws), but might increasingly deviate from 50% but then at some point return back closer to 50% if the komi was fair and allowed draws.

yauwing commented 1 year ago

nnCacheSizePowerOfTwo is 20. It seem to be growing beyond 2.4GB when I conduct another test after testing 400 games but I aborted it after a few dozen games for fear I would eventually run out of memory.

lightvector commented 1 year ago

Yes, that's the default, and 2.4GB doesn't seem that surprising. 2^20 entries is 1 million entries, and each entry is about 1.5KB, so that's 1.5GB. It uses some memory for things besides the cache, (which is why there is memory use on startup even though the cache is empty). So reaching 2.4GB is probably not so surprising?

I think you might need to run longer to see if it grows any further to tell for sure if there is any leak, or if this is just normal behavior. If the RAM on your computer is only like 4GB or something so that you are worried about running out, you can decrease the nnCacheSizePowerOfTwo to 18, and then the cache will use much less, giving you more of a buffer of memory to run more games for longer to see how much longer it keeps increasing.

yauwing commented 1 year ago

Just tried 1000 games at maxplayouts = 200 to save time memory usage passed 3GB, looks like memory usage increased slower at lower maxplayouts

BTW win rate becomes close 50% as you predicted

Cabu commented 9 months ago

Capture d’écran 2023-11-26 154637

Capture d’écran 2023-11-26 162225

After running katago for a day and just after restarting it...

TTXS123OK commented 9 months ago

nnCacheSizePowerOfTwo

Hello, I've been a user of KataGo for 4 years and I'm quite interested in this memory issue. I performed a simple analysis using Valgrind on the kata-analyze process and found no memory leak problems reported after quitting KataGo. This suggests that there are no unreleased memory spaces after the termination of the KataGo process.

However, I've noticed an issue. I'm currently using kata-go-1.13.2, compiled with -DUSE_BACKEND=EIGEN -DUSE_AVX2=1, and running it with ./katago gtp -model models/10b.txt.gz -config configs/10b.cfg. In the 10b.cfg, I've set nnCacheSizePowerOfTwo=15, which, according to your calculations, should mean the cache occupies no more than 100MB. When I use kata-analyze interval 100 maxmoves 3 for analysis and supervise with htop, I observe that after some time, the memory usage significantly exceeds the initial memory (990MB) plus cache occupation (100MB), reaching over 2000MB and continuing to grow. This might not be a result we would like to see.

Another point worth noting is that after a clear_board command, KataGo does not release the previously requested memory space. Instead, it continues to compute within the previously allocated memory space and starts to request new memory after a while, leading to a situation where the memory occupied by the KataGo process only increases as long as the process is not terminated.

If possible, I would like to try to investigate the cause of this issue and find a solution. Thank you, lightvector, for your patient explanation and guidance!