lightvector / KataGo

GTP engine and self-play learning in Go
https://katagotraining.org/
Other
3.48k stars 562 forks source link

[Question] Is KataGo 40b saturated? #558

Open y-ich opened 2 years ago

y-ich commented 2 years ago

Hi.

Excuse me if it was already discussed on Discord. (It is hard to follow DIscord's thread...)

Now the strongest confidently-rated network is kata1-b40c256-s9948109056-d2425397051, whose "s" number returns back to less than 10G. Is 40b weight saturated?

lightvector commented 2 years ago

Thanks for the question. 40b is probably not saturated, it probably will continue to improve at the same overall trend that it has been for a little while more. One contributing issue is that 60b networks are taking a lot of the computation of the rating games, since they have higher ratings (so the algorithm for doing pairing tends to prefer them) and also 60b rating games are much slower and more costly because the network is bigger.

Remember, it is not simply "strongest network" (because there is statistical noise), it is "strongest confidently-rated network". If all the recent networks are less confidently rated because 60b is stealing away their rating games, then it is of course harder for them to be "strongest confidently-rated" even if they are stronger.

In particular kata1-b40c256-s10150379520-d2474366098 is a good candidate for a network that might be stronger, although of course not guaranteed, but certainly it is less confident right now.

I'll see if I can adjust the pairing algorithm to put a bit more focus on the 40b networks again and less on the 60b for now.

Regardless, we will also be switching to 60b before too long. If you or anyone else would be able to do some time-parity tests on 60b vs 40b, that would be a great help. It's been a few months since anyone has done such tests at a serious scale, at least as far as posting results to here or Discord.

sbbdms commented 2 years ago

Regardless, we will also be switching to 60b before too long. If you or anyone else would be able to do some time-parity tests on 60b vs 40b, that would be a great help. It's been a few months since anyone has done such tests at a serious scale, at least as far as posting results to here or Discord.

Recently I finished a test between b60s457 and b40s1007, which is the strongest confidently-rated 60b/40b network before I started the test.

(1) In the test, the b60s457 plays with 8000 playouts/move, and the b40s1007 plays with 16000 playouts/move. Though in my observation, under the same time parity, 40b network can play a little more than 2 times of playouts of 60b network, so this setting might be a little bit unfair to the 40b network.

(2) The test plays under some pairs of custom parameters, which is the final set of params from my custom test mentioned in https://github.com/lightvector/KataGo/issues/508. Whether the final set of params is stronger than the current released params, is quite hard too assess --- The result may vary much between different networks. However its overall strength should not be worse than the current one?

(3) 100 match games played in total. The 60b network has a 53% winrate. Considering 40b network can play a little more playouts in same time parity, I think the current strength (Under the same time parity) between 40b and 60b network might be similar. However the 60b has greater potential than the 40b, so I guess switching to 60b should be considerable.

I will arrange my procedure, SGFs and the result of my test, maybe in this weekend. Please point out if there's any mistake in my test.

I am also seeking for more assessment to my final set of params (Especially its overall strength with different strong networks). If its overall strength proved to be weaker in your further assessment, then you can consider assessing the 2nd/3rd/4th set of params. I wish that there's at least one set of params from the test helps.

Friday9i commented 2 years ago

Cool, so 60b and 40b are around the same strength on time parity with 8K/16K visits, excellent! If you want to do more tests, it would probably be useful to test with "selfplay rythm", ie 1K/2K visits. 60b may be a bit weaker but from all the tests I did in the past, I expect the result should be quite comparable. @lightvector : from this result, if we assume a strength close to parity for 40b and 60b, wouldn't it be efficient to switch now to 60b with a bit less visits than currently, eg 30% less? Selfplay would still be stronger (despite the -30% visits) and the speed penalty would be quite reasonable.

lightvector commented 2 years ago

Thanks for the tests! It sounds like we're close enough to time parity that we I agree with switching soon. I will get the release out supporting TensorRT, and then we can drop the learning rate for 40b, run for a little while more, and then switch to 60b after that. I'd be interested to see the more detailed numbers.

sbbdms commented 2 years ago

Hi!

I have uploaded the data of my latest test in https://github.com/lightvector/KataGo/issues/508. The data includes match games between b60s457 and b40s1007 which I mentioned above. Please have a check. Thanks!

dionren commented 2 years ago

@sbbdms I think we should take more visits to evaluate 40b/60b, because if with more visits like 100K vs 200K, the situation maybe different. And with 8x3090, 15s/hand will have 1M visits. If you can do more tests, i'm willing donate gpus for u.

sbbdms commented 2 years ago

@dionren The data in https://github.com/lightvector/KataGo/issues/508 contains config file of the match, and commands of BayesElo in description file (To assess the elo). If you have enough resources to start the match with extreme high playouts per move, then you can use the config file with modifications below:

maxPlayouts0 = 200000 maxPlayouts1 = 100000

You may also change the "CUDA GPU settings" or "OpenCL GPU settings" to enable multiple GPUs for the match, where I have no experience... Running such 100 games with 8x3090 might cost around 3~4 days in total. However I guess the result may not differ too much from mine. ;)

dionren commented 2 years ago

@sbbdms thanks, i'll take a look.

dionren commented 2 years ago

@sbbdms where i can get readpgn ? and how to generate pgn from sgf files?

sbbdms commented 2 years ago

@dionren

"readpgn" is a command in BayesElo (https://www.remi-coulom.fr/Bayesian-Elo/#download), the command should be followed by a filename which suffix is ".pgn". e.g. "readpgn minimal.pgn"

I use a script from Internet which can rename the filename of SGFS to something like "1.sgfs", "2.sgfs", "3.sgfs"...

rename.zip

Then I write a program in C++ to extract the results from SGFS files with specified filenames above. You can compile it for yourself:

sgfsExtractor.zip

After that, you can extract the result from SGFS files to a file named "minimal.pgn" as follows: (1) Copy the script, compiled program above, and the SGFS files (With their initial filenames, if their filenames are already like "1.sgfs", "2.sgfs", then the script might not work) in the same directory. (2) Run the script to rename the SGFS files. (3) Run the compiled program, then the results of SGFS files are saved in a file named "minimal.pgn". Then you can use the "readpgn minimal.pgn" command in BayesElo. The full command to use BayesElo is included in the first description file in https://github.com/lightvector/KataGo/issues/508.

dionren commented 2 years ago

@sbbdms ok

michito744 commented 2 years ago

@sbbdms This can happen. 2021-10-18 (2) 2021-10-18 (4) 2021-10-18 (3)

I always use this phase to test if the generated network can be used for evaluation.

dionren commented 2 years ago

@michito744 can you share this sgf?

michito744 commented 2 years ago

@dionren ok.

(;CA[UTF-8]KM[6.5]GM[1]SZ[19]GN[]PW[W]CP[]AP[Lizzie: 0.7.4]DT[2021-10-19]EV[]PB[B]RE[]PC[]TM[]CA[UTF-8];B[dp];W[pp];B[dc];W[pd];B[nc];W[ce];B[dh];W[fe];B[fc];W[cj];B[ee];W[dg];B[cg];W[ed];B[dd];W[cf];B[bh];W[eh];B[di];W[bg];B[ch];W[de];B[ef];W[fd];B[eg];W[ec];B[eb];W[fb];B[db];W[gc];B[cd];W[bd];B[bc];W[fh];B[df];W[lc];B[ne];W[pf];B[ng];W[ph];B[jc];W[kd];B[ie];W[jd];B[id];W[ic];B[hc];W[ib];B[hb];W[jb];B[ff];W[gf];B[ge];W[fg];B[dg];W[he];B[hd];W[gd];B[hg];W[gg];B[hh];W[gb];B[ia];W[ja];B[ha];W[fj];B[hf];W[ge];B[kb];W[kc];B[ka];W[jc];B[kf];W[lf];B[kg];W[md];B[nd];W[mf];B[nf];W[lh];B[lg];W[mg];B[mh];W[nh];B[mi];W[ke];B[me];W[le];B[kh];W[jf];B[if];W[mc];B[pc];W[qc];B[qb];W[pb];B[oc];W[rb];B[qd];W[rc];B[oh];W[ni];B[pg];W[oi];B[og];W[li];B[mj];W[lj];B[mk];W[ji];B[hj];W[pi];B[qg];W[qe];B[ri];W[qj];B[rj];W[ql];B[rk];W[pm];B[jg];W[qh];B[rh];W[rf];B[rg];W[ob];B[pj];W[oj];B[rm];W[rl];B[sl];W[qn];B[rn];W[qo];B[rp];W[ro])

456 has a super hard mode game that KataGo can't solve properly.

sbbdms commented 2 years ago

The recent three 40b networks (s1031, s1033 and s1035) seem to have significant elo gain. Is it the result of dropping learning rate?

Also I would like to know if my data and conclusion above is read during these weeks...