Open lcwd2 opened 5 years ago
if play 400 more games, maybe 90 percent of promoted weights would not appear, it would be very boring
Even with the true winning rate=55%, the probability that the sample winning rate >55% is 0.5. To pass for both tests is not easy.
Another thing is, those two upgrades in a short time usually came after the previous one lasted for a long time. Yes, the first one might pass because of luck, but it might not be a bad thing. It may lead to jumpping out of a local minimum.
according to my test, the quantize is cutting the weight strength, shall we use the original weight of training output?
my match result for the first model in the two pairs of fast upgrades are:
lz243 vs lz242 220 out of 427 games 51.5% lz246 vs lz245 211 out of 427 games 49.4%
Is it possible to schedule an official test so that we can get an idea of the extent of possible result variation?
Re-scheduling / repeating a few selected matches to measure match variance is something I've been suggesting for ages, but so far nobody else seemed to be interested. (Yet there were various theories and comparisons involving match results - none of which make sense without such backround.)
Ok I can't request tests with the same exact params as the prior match event, but I'll run them at visits=1602. Closest I can do. :)
243 vs 242 47% now, its normal because bjiyxo's weights passed their tests and only a few passed official match
Thank you. It's a pity the reruns failed around 300 games, so their nominal results are not to be taken literally, but this still explains a few things. It seems recent 55% results are based almost completely on luck (related to batching?).
Although I don't expect to see any striking differences (besides winrates) I'll also post analyses to #2322.
It seems recent 55% results are based almost completely on luck (related to batching?).
Could be pure luck. For a net with true w.r. 50%, it will win 219 games or less out of 400 with probability 0.9745, which is roughly the rate to fail. If 20 such kind of nets play against the current best, the probability that all fails is 0.9745^20= 0.5965. That means, with 40% probability, we will see at least one net to pass by luck.
This also assumes there are no nets >50% (otherwise you have higher chance to also see some more consistent results).
Originally I would have expected most promotions to be around 51% (with 4% from luck). Oc this is still possible, but nominal sd here is 2.5%, and if two randomly chosen reruns both dropped 3sd, that may (or may not :)) mean something. In any case, whether results are more consistent without batching is something that can actually be tested.
I don't think it will be a problem very much. Already other AIs are using low gatekeeping to add variety to training data generation. Why don't we test the changes in the way?
Minigo has shown LZ has a great ELO improvement from 1 net to the next net. However it shows little improvement delta past 1 net (see below) . A way to improve generalised improvement is to match the test net against the current net, and the previous 2 nets. This should improve the robustness of future "best nets". I'd personally suggest
1. If >53% winrate over current net, go to step 2
2. If >53% winrate over 2 previous best nets then promote.
This promotion protocol has been supported by previous discussion over what the selection process can be. However there has been much discussion above as to the best promotion schedule. What the evidence shows is we need to be focused on not just beating the previous net, but making sure the next best net is able to beat others too.
The reduced requirement for winrate is due to increased statistical power with increased match games.
Graph of ELO delta over model numbers https://i.imgur.com/ueIXzm8.png
Minigo Model graphs: https://cloudygo.com/leela-zero/graphs
Conclusion first:No.243 promotion didn't give us a fake but really a good one.
In this promotion, sort equidistant according to the length of the matches and we have :
MatchLength | LZ 243 | LZ 242 | winrate_of_match_in_length | winrate_reverse_calc |
---|---|---|---|---|
80 | 0 | 0 | 0 | 0 |
100 | 0 | 6 | 0.00% | 56.64% |
120 | 14 | 4 | 350.00% | 57.45% |
140 | 14 | 9 | 155.56% | 56.53% |
160 | 25 | 12 | 208.33% | 56.27% |
180 | 21 | 19 | 110.53% | 55.03% |
200 | 24 | 26 | 92.31% | 55.37% |
220 | 18 | 16 | 112.50% | 56.85% |
240 | 17 | 14 | 121.43% | 57.48% |
260 | 26 | 18 | 144.44% | 57.92% |
280 | 28 | 19 | 147.37% | 57.55% |
300 | 27 | 20 | 135.00% | 56.52% |
320 | 15 | 13 | 115.38% | 55.56% |
340 | 8 | 6 | 133.33% | 58.82% |
360 | 0 | 0 | 0.00% | 66.67% |
380 | 1 | 1 | 100.00% | |
400 | 1 | 0 | 0 | |
420 | 0 | 0 | 0 |
No. 243 have excellent performance at short length matches, which might indicate opponent has ladder problem or isn't good at some dead/live sequences. I'm not sure how good 243 when game goes deeper. So I calculates winrates from bottom(400-300-200-100), and in each level winrate indicates 55% and above ! I see a good gamer, quick and steady. So this is No. 243 at first glance.
In this re-match, sort equidistant according to the length of the matches and we have :
MatchLength | LZ 242 | LZ 243 | winrate_of_match_in_length | winrate_reverse_calc |
---|---|---|---|---|
80 | 0 | 0 | 0.00% | 0 |
100 | 0 | 1 | 0.00% | 51.72% |
120 | 3 | 2 | 150.00% | 51.90% |
140 | 5 | 2 | 250.00% | 51.76% |
160 | 9 | 13 | 69.23% | 51.26% |
180 | 14 | 15 | 93.33% | 52.16% |
200 | 19 | 11 | 172.73% | 52.65% |
220 | 19 | 6 | 316.67% | 51.02% |
240 | 8 | 17 | 47.06% | 47.37% |
260 | 8 | 18 | 44.44% | 50.00% |
280 | 22 | 18 | 122.22% | 54.17% |
300 | 19 | 21 | 90.48% | 53.75% |
320 | 10 | 9 | 111.11% | 60.00% |
340 | 8 | 5 | 160.00% | 66.67% |
360 | 3 | 2 | 150.00% | 75.00% |
380 | 2 | 0 | 0.00% | |
400 | 1 | 0 | 0.00% | |
420 | 0 | 0 | 0.00% |
When calculating winrates in reverse order, we can see No. 242 may be not so bad at second half of matches as before, but No. 243 played better in the first half.
Is No. 243 a lucky promotion? I have to say it has some good skills.
Minigo has shown LZ has a great ELO improvement from 1 net to the next net. However it shows little improvement delta past 1 net (see below) . A way to improve generalised improvement is to match the test net against the current net, and the previous 2 nets. This should improve the robustness of future "best nets". I'd personally suggest
1. If >53% winrate over current net, go to step 2 2. If >53% winrate over 2 previous best nets then promote.
This promotion protocol has been supported by previous discussion over what the selection process can be. However there has been much discussion above as to the best promotion schedule. What the evidence shows is we need to be focused on not just beating the previous net, but making sure the next best net is able to beat others too.
The reduced requirement for winrate is due to increased statistical power with increased match games.
Graph of ELO delta over model numbers https://i.imgur.com/ueIXzm8.png
Minigo Model graphs: https://cloudygo.com/leela-zero/graphs
I agree with @jillybob to modify promotion schedule.
Beat them all !
I'm not sure how you interpret "winrate_reverse_calc", but if - as looks from the first table - this is the cumulative winrate from the bottom up, then something seems wrong with the top of 2nd table (real result 48%).
These two rerun results can not be taken too seriously yet, but they seem to indicate promotions currently happen randomly, almost completely from luck. In this case neither of these suggestions seem promising - they may rarefy promotions only. Having more sound promotions is achievable by other, simpler means as well: more games, or more visits (the latter increases net differences, so only helps if there ARE differences to select on).
But are there? IMO the problem is not that some random lucky nets are promoted - this is just the consequence of not having truly better nets among candidates. Normally the latter should be the more common promotion (good nets need less luck to pass). So this seems to be more of a problem with selfplay quality or training.
I heard opinions that this is just the 40b plateu already, but my personal feeling is that LZ have not reached AGZ level yet, thus there must be more potential in 40b. And these problems with slower progress seem to be started around the time batching was enabled. Remember the HUGE (!) difference in match statistics (forking speed) between around 215 vs around 220? Something significantly changed there wrt matches, presumably in selfplay as well. And IIRC that was around the time batching and lcb came into the picture.
to jumpping out of a local minimum, shall we force promote the weight seems best if no weight passed after a big amount (for example 200k)self games played?
I have done some small scale experiments on a 9x9 board to check the relationship between the number of games and the win%. I only used games on the current best model for the experiment instead of a fixed moving window including games from previous models as in AZ. The relationship seems to be nicely linear at least up to the 55% mark. In other words, we cannot expect an upgrade until there are sufficient self play games on the current best model. This seems logical. If this is really the case, I wonder whether older games really contributes to or stabilize an upgrade.
The only genuine upgrade between lz242 and lz247 is lz245 which upgraded with around 200K games. The other upgrades are either fake or remedial. This probably suggests that if the upgrades are genuine, we can reduce the moving window to 200K games to focus more to learn from the current model. The AZ choice of a 500K window might have compensated for the uncertain 55% upgrade with which we need older data for remedial upgrades after the fake upgrades. But a fake upgrade will create a little bit of fake data that can smoke screen the training!
Nevertheless, I don't have sufficient resources to carry out large scale experiments.
cgos lz vs elfv2 series
---------- ELFV2_MATCH.TXT
LZ_241_466f_p400 3299 12 / 32 37.50
LZ_242_832f_p400 3297 72 / 203 35.47
LZ_247_lad_c5_p400 3289 16 / 49 32.65
LZ_245_3691_p400 3249 28 / 75 37.33
LZ_243_ece8_p400 3238 15 / 43 34.88
LZ_244_2bae_p400 3236 30 / 93 32.26
LZ_246_251e_p400 3222 10 / 16 62.50
LZ_240_0e17_p400 3197 13 / 17 76.47
LZ_247_901e_p400 3195 31 / 76 40.79
LZ_238_fe85_p400 3283 50 / 112 44.64
LZ_239_3bd9_p400 3279 7 / 17 41.18
LZ_237_657a_p400 3271 7 / 27 25.93
LZ_232_06e0_p400 3271 59 / 136 43.38
LZ_235_a4f5_p400 3263 10 / 29 34.48
LZ_233_16d9_p400 3246 48 / 107 44.86
LZ_234_ac9b_p400 3223 74 / 169 43.79
LZ_236_1d93_p400 3219 43 / 84 51.19
LZ_231_f178_p400 3219 37 / 78 47.44
LZ_230_a954_p400 3217 13 / 38 34.21
lz vs katago series
---------- KATA_MATCH.TXT
LZ_242_832f_p400 3294 57 / 170 33.53
LZ_247_lad_c5_p400 3288 6 / 42 14.29
LZ_245_3691_p400 3250 42 / 116 36.21
LZ_244_2bae_p400 3234 40 / 114 35.09
LZ_243_ece8_p400 3233 14 / 31 45.16
LZ_246_251e_p400 3221 6 / 19 31.58
LZ_247_901e_p400 3201 18 / 55 32.73
to jumpping out of a local minimum, shall we force promote the weight seems best ( 9b2bef8e, 54.37 vs lz247)if no weight passed after a big amount (for example 200k)self games played?
I didn't know where to put this but I'll put it here in case anyone finds it interesting:
I often skim the kifu of test matches for new passing networks that win as Black by early resignation (I see more variation in the opening moves):
This caught my eye: In moves 125-141 black kills white making a 6 point nakade: I don't think I've seen LZ do this before but I don't go through every test match either.
(As a weak amateur, by move 122, if I was white I also wouldn't be too worried about my shape at the bottom and would probably end the ko.)
I have done 400 game matches between lz243/244, lz246/247, lz242/247 to try to confirm a suspected upgrade problem.
It seems that from time to time, we are getting bogus upgrades. The indicator is that when we have a pair of upgrades within a short time, e.g. from lz243 to lz 244, from lz246 to lz247, the first in the pair is bogus. They are less than 50% wins and as a result we can get another upgrade very quickly. To avoid this situation, when we get a suspected upgrade, it is better to play an additional 400 game match to confirm that the upgrade is genuine. Otherwise, we will be wasting a lot of computational resources. This is perhaps also the reason that we can only get 60% win for models 10 upgrades apart. At this particular moment, it seems that lz247 is no better than lz242.