A question in the "match" module

lightvector / KataGo

GTP engine and self-play learning in Go

https://katagotraining.org/

Other

3.49k stars 564 forks source link

A question in the "match" module #916

Open sbbdms opened 6 months ago

sbbdms commented 6 months ago

Hi.

Recently I am testing some custom modification via KataGo's "match" module. There are 2 bots in my test, which use the same NN model, with different param settings. The settings above can be written in the config file with 2 forms below:

(1) botName0 = a botName1 = b nnModelFile = nn.bin.gz

(2) botName0 = a botName1 = b nnModelFile0 = nn0.bin.gz nnModelFile1 = nn1.bin.gz (nn.bin.gz, nn0.bin.gz, nn1.bin.gz are actually the same NN model file)

For quite a long time, I am using the first form in the config file. Later I accidently found that the second form runs ~20% quicker than the first form (With 3x4090, 288 games in parallel. 1 thread per game)

Is it possible to modify the code, to make KataGo automatically adjust its model usage to the second form, even if the config file is written in the first form?

Thanks!

lightvector commented 6 months ago

What are you setting nnMaxBatchSize to?

sbbdms commented 6 months ago

It is the default value 32.

Sorry, since I always use the default nnMaxBatchSize value in gtp_example.cfg, I neglected to adjust this value in match.

I noticed that the recommended nnMaxBatchSize value in GTP is (numSearchThreads / numNNServerThreadsPerModel), so I guess the recommended value is 288/3=96 in my settings?

sbbdms commented 6 months ago

I tried to adjust nnMaxBatchSize from 32 to 96 with the first form, however it seems that it is still ~10% slower than the second form with nnMaxBatchSize 32.

lightvector commented 6 months ago

That's interesting!

What happens if you use only 1x4090 instead of 3x4090?
What happens if you run a different number of games in parallel than 288?
Do you have any GPUs other than 4090 that you can test this on?
What happens if you run only 1 game at a time, but that game has a large number of search threads?

It wouldn't be appropriate to make this a general recommendation, and definitely not appropriate to modify the code to do it automatically, if we can't understand how general this is and what causes it. I think we probably wouldn't modify the code to do it automatically regardless, because loading the model multiple times on the GPU takes extra memory, which might not be suitable for some users. But it would still be interesting to learn more about this.