leela-zero / leela-zero

Go engine with no human-provided knowledge, modeled after the AlphaGo Zero paper.
GNU General Public License v3.0
5.35k stars 1.01k forks source link

bogus upgrades #2510

Open lcwd2 opened 5 years ago

lcwd2 commented 5 years ago

I have done 400 game matches between lz243/244, lz246/247, lz242/247 to try to confirm a suspected upgrade problem.

It seems that from time to time, we are getting bogus upgrades. The indicator is that when we have a pair of upgrades within a short time, e.g. from lz243 to lz 244, from lz246 to lz247, the first in the pair is bogus. They are less than 50% wins and as a result we can get another upgrade very quickly. To avoid this situation, when we get a suspected upgrade, it is better to play an additional 400 game match to confirm that the upgrade is genuine. Otherwise, we will be wasting a lot of computational resources. This is perhaps also the reason that we can only get 60% win for models 10 upgrades apart. At this particular moment, it seems that lz247 is no better than lz242.

l1t1 commented 5 years ago

if play 400 more games, maybe 90 percent of promoted weights would not appear, it would be very boring

bubblesld commented 5 years ago

Even with the true winning rate=55%, the probability that the sample winning rate >55% is 0.5. To pass for both tests is not easy.

Another thing is, those two upgrades in a short time usually came after the previous one lasted for a long time. Yes, the first one might pass because of luck, but it might not be a bad thing. It may lead to jumpping out of a local minimum.

l1t1 commented 5 years ago

according to my test, the quantize is cutting the weight strength, shall we use the original weight of training output?

lcwd2 commented 5 years ago

my match result for the first model in the two pairs of fast upgrades are:

lz243 vs lz242 220 out of 427 games 51.5% lz246 vs lz245 211 out of 427 games 49.4%

Is it possible to schedule an official test so that we can get an idea of the extent of possible result variation?

tapsika commented 5 years ago

Re-scheduling / repeating a few selected matches to measure match variance is something I've been suggesting for ages, but so far nobody else seemed to be interested. (Yet there were various theories and comparisons involving match results - none of which make sense without such backround.)

roy7 commented 5 years ago

Ok I can't request tests with the same exact params as the prior match event, but I'll run them at visits=1602. Closest I can do. :)

l1t1 commented 5 years ago

243 vs 242 47% now, its normal because bjiyxo's weights passed their tests and only a few passed official match

tapsika commented 5 years ago

Thank you. It's a pity the reruns failed around 300 games, so their nominal results are not to be taken literally, but this still explains a few things. It seems recent 55% results are based almost completely on luck (related to batching?).

Although I don't expect to see any striking differences (besides winrates) I'll also post analyses to #2322.

bubblesld commented 5 years ago

It seems recent 55% results are based almost completely on luck (related to batching?).

Could be pure luck. For a net with true w.r. 50%, it will win 219 games or less out of 400 with probability 0.9745, which is roughly the rate to fail. If 20 such kind of nets play against the current best, the probability that all fails is 0.9745^20= 0.5965. That means, with 40% probability, we will see at least one net to pass by luck.

tapsika commented 5 years ago

This also assumes there are no nets >50% (otherwise you have higher chance to also see some more consistent results).

Originally I would have expected most promotions to be around 51% (with 4% from luck). Oc this is still possible, but nominal sd here is 2.5%, and if two randomly chosen reruns both dropped 3sd, that may (or may not :)) mean something. In any case, whether results are more consistent without batching is something that can actually be tested.

22nsuk commented 5 years ago

I don't think it will be a problem very much. Already other AIs are using low gatekeeping to add variety to training data generation. Why don't we test the changes in the way?

jillybob commented 5 years ago

Minigo has shown LZ has a great ELO improvement from 1 net to the next net. However it shows little improvement delta past 1 net (see below) . A way to improve generalised improvement is to match the test net against the current net, and the previous 2 nets. This should improve the robustness of future "best nets". I'd personally suggest

1. If >53% winrate over current net, go to step 2
2.  If >53% winrate over 2 previous best nets then promote.

This promotion protocol has been supported by previous discussion over what the selection process can be. However there has been much discussion above as to the best promotion schedule. What the evidence shows is we need to be focused on not just beating the previous net, but making sure the next best net is able to beat others too.

The reduced requirement for winrate is due to increased statistical power with increased match games.

Graph of ELO delta over model numbers https://i.imgur.com/ueIXzm8.png

Minigo Model graphs: https://cloudygo.com/leela-zero/graphs

nanzi commented 5 years ago

Conclusion first:No.243 promotion didn't give us a fake but really a good one.

In this promotion, sort equidistant according to the length of the matches and we have :

MatchLength LZ 243 LZ 242 winrate_of_match_in_length winrate_reverse_calc
80 0 0 0 0
100 0 6 0.00% 56.64%
120 14 4 350.00% 57.45%
140 14 9 155.56% 56.53%
160 25 12 208.33% 56.27%
180 21 19 110.53% 55.03%
200 24 26 92.31% 55.37%
220 18 16 112.50% 56.85%
240 17 14 121.43% 57.48%
260 26 18 144.44% 57.92%
280 28 19 147.37% 57.55%
300 27 20 135.00% 56.52%
320 15 13 115.38% 55.56%
340 8 6 133.33% 58.82%
360 0 0 0.00% 66.67%
380 1 1 100.00%  
400 1 0 0  
420 0 0 0  

No. 243 have excellent performance at short length matches, which might indicate opponent has ladder problem or isn't good at some dead/live sequences. I'm not sure how good 243 when game goes deeper. So I calculates winrates from bottom(400-300-200-100), and in each level winrate indicates 55% and above ! I see a good gamer, quick and steady. So this is No. 243 at first glance.

In this re-match, sort equidistant according to the length of the matches and we have :

MatchLength LZ 242 LZ 243 winrate_of_match_in_length winrate_reverse_calc
80 0 0 0.00% 0
100 0 1 0.00% 51.72%
120 3 2 150.00% 51.90%
140 5 2 250.00% 51.76%
160 9 13 69.23% 51.26%
180 14 15 93.33% 52.16%
200 19 11 172.73% 52.65%
220 19 6 316.67% 51.02%
240 8 17 47.06% 47.37%
260 8 18 44.44% 50.00%
280 22 18 122.22% 54.17%
300 19 21 90.48% 53.75%
320 10 9 111.11% 60.00%
340 8 5 160.00% 66.67%
360 3 2 150.00% 75.00%
380 2 0 0.00%  
400 1 0 0.00%  
420 0 0 0.00%  

When calculating winrates in reverse order, we can see No. 242 may be not so bad at second half of matches as before, but No. 243 played better in the first half.

Is No. 243 a lucky promotion? I have to say it has some good skills.

nanzi commented 5 years ago

Minigo has shown LZ has a great ELO improvement from 1 net to the next net. However it shows little improvement delta past 1 net (see below) . A way to improve generalised improvement is to match the test net against the current net, and the previous 2 nets. This should improve the robustness of future "best nets". I'd personally suggest

1. If >53% winrate over current net, go to step 2
2.  If >53% winrate over 2 previous best nets then promote.

This promotion protocol has been supported by previous discussion over what the selection process can be. However there has been much discussion above as to the best promotion schedule. What the evidence shows is we need to be focused on not just beating the previous net, but making sure the next best net is able to beat others too.

The reduced requirement for winrate is due to increased statistical power with increased match games.

Graph of ELO delta over model numbers https://i.imgur.com/ueIXzm8.png

Minigo Model graphs: https://cloudygo.com/leela-zero/graphs

I agree with @jillybob to modify promotion schedule.

Beat them all !

tapsika commented 5 years ago

I'm not sure how you interpret "winrate_reverse_calc", but if - as looks from the first table - this is the cumulative winrate from the bottom up, then something seems wrong with the top of 2nd table (real result 48%).

These two rerun results can not be taken too seriously yet, but they seem to indicate promotions currently happen randomly, almost completely from luck. In this case neither of these suggestions seem promising - they may rarefy promotions only. Having more sound promotions is achievable by other, simpler means as well: more games, or more visits (the latter increases net differences, so only helps if there ARE differences to select on).

But are there? IMO the problem is not that some random lucky nets are promoted - this is just the consequence of not having truly better nets among candidates. Normally the latter should be the more common promotion (good nets need less luck to pass). So this seems to be more of a problem with selfplay quality or training.

I heard opinions that this is just the 40b plateu already, but my personal feeling is that LZ have not reached AGZ level yet, thus there must be more potential in 40b. And these problems with slower progress seem to be started around the time batching was enabled. Remember the HUGE (!) difference in match statistics (forking speed) between around 215 vs around 220? Something significantly changed there wrt matches, presumably in selfplay as well. And IIRC that was around the time batching and lcb came into the picture.

l1t1 commented 5 years ago

to jumpping out of a local minimum, shall we force promote the weight seems best if no weight passed after a big amount (for example 200k)self games played?

lcwd2 commented 5 years ago

I have done some small scale experiments on a 9x9 board to check the relationship between the number of games and the win%. I only used games on the current best model for the experiment instead of a fixed moving window including games from previous models as in AZ. The relationship seems to be nicely linear at least up to the 55% mark. In other words, we cannot expect an upgrade until there are sufficient self play games on the current best model. This seems logical. If this is really the case, I wonder whether older games really contributes to or stabilize an upgrade.

The only genuine upgrade between lz242 and lz247 is lz245 which upgraded with around 200K games. The other upgrades are either fake or remedial. This probably suggests that if the upgrades are genuine, we can reduce the moving window to 200K games to focus more to learn from the current model. The AZ choice of a 500K window might have compensated for the uncertain 55% upgrade with which we need older data for remedial upgrades after the fake upgrades. But a fake upgrade will create a little bit of fake data that can smoke screen the training!

Nevertheless, I don't have sufficient resources to carry out large scale experiments.

l1t1 commented 5 years ago

cgos lz vs elfv2 series

---------- ELFV2_MATCH.TXT
LZ_241_466f_p400        3299    12 / 32 37.50
LZ_242_832f_p400        3297    72 / 203        35.47
LZ_247_lad_c5_p400      3289    16 / 49 32.65
LZ_245_3691_p400        3249    28 / 75 37.33
LZ_243_ece8_p400        3238    15 / 43 34.88
LZ_244_2bae_p400        3236    30 / 93 32.26
LZ_246_251e_p400        3222    10 / 16 62.50
LZ_240_0e17_p400        3197    13 / 17 76.47
LZ_247_901e_p400        3195    31 / 76 40.79

LZ_238_fe85_p400        3283    50 / 112        44.64
LZ_239_3bd9_p400        3279    7 / 17  41.18
LZ_237_657a_p400        3271    7 / 27  25.93
LZ_232_06e0_p400        3271    59 / 136        43.38
LZ_235_a4f5_p400        3263    10 / 29 34.48
LZ_233_16d9_p400        3246    48 / 107        44.86
LZ_234_ac9b_p400        3223    74 / 169        43.79
LZ_236_1d93_p400        3219    43 / 84 51.19
LZ_231_f178_p400        3219    37 / 78 47.44
LZ_230_a954_p400        3217    13 / 38 34.21

lz vs katago series

---------- KATA_MATCH.TXT
LZ_242_832f_p400 3294 57 / 170 33.53
LZ_247_lad_c5_p400 3288 6 / 42 14.29
LZ_245_3691_p400 3250 42 / 116 36.21
LZ_244_2bae_p400 3234 40 / 114 35.09
LZ_243_ece8_p400 3233 14 / 31 45.16
LZ_246_251e_p400 3221 6 / 19 31.58
LZ_247_901e_p400 3201 18 / 55 32.73
l1t1 commented 5 years ago

to jumpping out of a local minimum, shall we force promote the weight seems best ( 9b2bef8e, 54.37 vs lz247)if no weight passed after a big amount (for example 200k)self games played?

dbosst commented 5 years ago

I didn't know where to put this but I'll put it here in case anyone finds it interesting:

I often skim the kifu of test matches for new passing networks that win as Black by early resignation (I see more variation in the opening moves):

http://zero.sjeng.org/viewmatch/d2357ef947ecad2b7d4a1a5d914a5e0f80bc32ba27d20b21e1b40341ec988fb1?viewer=wgo

This caught my eye: In moves 125-141 black kills white making a 6 point nakade: I don't think I've seen LZ do this before but I don't go through every test match either.

(As a weak amateur, by move 122, if I was white I also wouldn't be too worried about my shape at the bottom and would probably end the ko.)