glinscott / leela-chess

**MOVED TO https://github.com/LeelaChessZero/leela-chess ** A chess adaption of GCP's Leela Zero
http://lczero.org
GNU General Public License v3.0
760 stars 299 forks source link

Training progress gen3 #100

Closed Error323 closed 6 years ago

Error323 commented 6 years ago

gen3

This time I properly interleaved the chunks and we're back to our trusty sudden drop in MSE Loss. I made 4 passes through 99'080 chunks, resulting in 4'954'000 datapoints. Here are the results of two separate 50 game matches of gen3 vs gen2. And I also played it against gen1 for good measure.

Score of lc_gen3 vs lc_gen2: 48 - 1 - 1  [0.970] 50
Elo difference: 603.86 +/- nan

Score of lc_gen3 vs lc_gen2: 43 - 5 - 2  [0.880] 50
Elo difference: 346.12 +/- 172.18

Score of lc_gen3 vs lc_seed: 48 - 0 - 2  [0.980] 50
Elo difference: 676.08 +/- nan

The network is uploaded and machines are crunching towards the next generation.

jkiliani commented 6 years ago

@Error323 Could you please set http://162.217.248.187/networks so it allows manually downloading archived nets? I'd like to do a strength test for the new net, but the client still deletes old nets automatically so I don't have de1ddecb anymore at the moment...

Error323 commented 6 years ago

@jkiliani let's ask @glinscott. It's not my server, I don't know what the bandwidth limitations and costs are.

jkiliani commented 6 years ago

Have you decided yet what you are going to use for training window size? With gen3 games, added to gen2 and seed games plus the random games you generated, we're probably going to end up above 250,000 games, which is the window Leela Zero uses. At that point, it could be useful to start windowing out the older games.

Edit: No rush about the server, you already tested it enough. But @glinscott, the .pgn viewer suddenly doesn't open for me anymore on the computer, just on Android. Was anything changed with the game viewer?

Error323 commented 6 years ago

Have you decided yet what you are going to use for training window size? With gen3 games, added to gen2 and seed games plus the random games you generated, we're probably going to end up above 250,000 games, which is the window Leela Zero uses. At that point, it could be useful to start windowing out the older games.

Through some experimentation :) I guess a binary search through the windowsize.

jkiliani commented 6 years ago

I'm currently seeing a lot of black wins in the self-play on my computer... how do the initial statistics look for 9b568ab2? Did the color imbalance even out or reverse?

Error323 commented 6 years ago

9b568ab2

jkiliani commented 6 years ago

So game length continuing to shorten, draws getting even more rare, color imbalance reducing. Looks great!

Zeta36 commented 6 years ago

Yes, this looks very promising!!

kiudee commented 6 years ago

I pitted this network at 100k playouts against Stockfish Level 1 on Lichess and it won for the first time: diagram-1

jkiliani commented 6 years ago

Cool, reminds me of when Leela Zero for the first time beat IdiotBot (30k) (https://www.reddit.com/r/cbaduk/comments/7f8nsu/lz_265k_vs_idiotbot/)

Do you agree that the value head already seems pretty well differentiated now toward winning material, but the policy head still has a lot of trouble finding the right moves to capture and protect pieces?

Error323 commented 6 years ago

Yeah a part of me thinks we should've used move encoding as described in the AlphaZero paper. Our flat representation lost spatial information from the convnets. It'll still work though and it reduces the chunk file size.

jjoshua2 commented 6 years ago

I thought we were using the same inputs as AZ. Why aren't we? I don't know what spatial information from convents means.

glinscott commented 6 years ago

@Error323 Could you please set http://162.217.248.187/networks so it allows manually downloading archived nets? I'd like to do a strength test for the new net, but the client still deletes old nets automatically so I don't have de1ddecb anymore at the moment...

@jkiliani done -- you can now download the networks from the networks page.

glinscott commented 6 years ago

I thought we were using the same inputs as AZ. Why aren't we? I don't know what spatial information from convents means.

@jjoshua2 the inputs the AZ team used are significantly larger. They both represent the same information, the AZ paper even mentioned they used the smaller representation, it just took slightly longer to converge. I figured for us non-TPU users, it might be better to start with this one :).

jjoshua2 commented 6 years ago

So we might run out of gpu memory with the full inputs? And were making a compute memory tradeoff towards more compute?

On Mar 7, 2018 2:05 PM, "Gary Linscott" notifications@github.com wrote:

I thought we were using the same inputs as AZ. Why aren't we? I don't know what spatial information from convents means.

@jjoshua2 https://github.com/jjoshua2 the inputs the AZ team used are significantly larger. They both represent the same information, the AZ paper even mentioned they used the smaller representation, it just took slightly longer to converge. I figured for us non-TPU users, it might be better to start with this one :).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/glinscott/leela-chess/issues/100#issuecomment-371248172, or mute the thread https://github.com/notifications/unsubscribe-auth/AO6INFc17h1_ono6jBPlSHGSv866aJq2ks5tcC9ugaJpZM4Sf9Pi .

glinscott commented 6 years ago

@jjoshua2 it should be both less compute, and less memory. It's just slightly harder for the network to learn.

jkiliani commented 6 years ago

I'm trying to repeat @kiudee's experiment with cutechess_cli. After some experimentation, I used

./cutechess-cli -rounds 100 -tournament gauntlet -concurrency 2 -pgnout SF0.pgn \
 -engine name=lc_gen3 cmd=lczero arg="--threads=1" arg="--weights=$WDR/gen2-64x6.txt" arg="--playouts=800" arg="--noponder" arg="--noise" tc=inf \
 -engine name=sf0 cmd=stockfish_x86-64 option.Threads=1 option."Skill Level"=1 tc=40/20 \
 -each proto=uci

to set up Stockfish in a hopefully properly handicapped fashion. Does anyone here have experience in how to use a lowered skill level Stockfish with cutechess_cli? When I set tc=inf, Stockfish seemed stuck even though I expected that it would move in fixed time by setting Skill Level...

I canceled the match at 6-0 in SF's favour since I'm not sure the handicap is actually working properly, in which case the match would be pointless...

kiudee commented 6 years ago

Just for reference, here are the levels and time controls lichess is using:

    AI level 1: skill 3/20, depth 1, 50ms
    AI level 2: skill 6/20, depth 2, 100ms
    AI level 3: skill 9/20, depth 3, 150ms
    AI level 4: skill 11/20, depth 4, 200ms
    AI level 5: skill 14/20, depth 6, 250ms
    AI level 6: skill 17/20, depth 8, 300ms
    AI level 7: skill 20/20, depth 10, 350ms
    AI level 8: skill 20/20, depth 12, 400ms

The time control is very short and in ms per move.

jkiliani commented 6 years ago

I think it's probably too early for matches against other engines, at least on 800 playouts. It was likely the 100k playouts you used which gave Leela Chess enough of a boost to win against a handicapped Stockfish. I'll just try this again in a few more network generations.

Error323 commented 6 years ago

p.s. I began training gen4 a few hours ago. using a window of 100K chunks, meaning the first rng chunks are no longer used now.

Error323 commented 6 years ago

:scream: :flushed: I need to step up my training game, our input rate is increasing. Tomorrow night I'm gonna revamp.

jkiliani commented 6 years ago

The game length statistic of decisive games for gen3 is starting to look bimodal, any ideas what this could represent?

Error323 commented 6 years ago

Could be the result of https://github.com/glinscott/leela-chess/commit/8b42445522fda80090588f91e58a579d355f7366

killerducky commented 6 years ago

@Error323 why does the MSE/Policy loss start over at a high value? Are you not starting from the previous best network weights?

Error323 commented 6 years ago

Indeed, it's also part of Friday night's code night. I want to start with a high lr but with previous weights.

The goal for now is to constantly take the last 100K chunks and use the previous weights. Start at lr 0.02 and decay with 0.1 after N steps. Then upload and repeat.

killerducky commented 6 years ago

Ok great. I like the idea of redoing the learning rate annealing every time. I guess the initial high LR will shake the old net out of its local maximum and then as the LR lowers it can settle into a new maximum.

Edit: More wild speculation: Maybe much later it could cause problems if shaking out of old local maximums causes it to forget knowledge it learned from games that have passed out of the training window. This could be an argument for not raising the LR. As usual it's hard to know without doing a "3 day" experiment. Assuming you have 5000 TPUs. ;-) In any case I don't think it will matter at all while the network is still so weak.

jkiliani commented 6 years ago

I think the method used by @Error323 will do just fine even later, if the reset learning rate at the start of each training run is gradually raised when the net approaches saturation. For the early training, this idea may well be the main reason for the steep progress curve so far.

Error323 commented 6 years ago

Ok I uploaded a new neural network, but I was impatient and so it's performance is not as great as the previous onces, but our game input rate is so high... I cracked under pressure :sweat_smile:

Score of lc_gen4 vs lc_gen3: 33 - 15 - 2  [0.680] 50
Elo difference: 130.94 +/- 104.97
Finished match

It's still a better net, but I should've trained for more steps (only 140K this time). Anyway, working on the online training version! It should bring a lot of relief :)

jkiliani commented 6 years ago

Not sure this is really the number of training steps... I think it rather plausible that the training window now includes such a wide range of playing strength that progress is somewhat held back by the remaining rng games.

Anyway, this net is still solid progress, thank you for your efforts! Are we choosing the always-adopt approach by the way? Sooner or later, a trained net is going to regress against the previous one, should we simply use it anyway at this point?

Edit: Results of the match I ran last night:

Score of lc_gen4 vs lc_gen3: 38 - 9 - 3  [0.790] 50
Elo difference: 230.16 +/- 122.02
Finished match

It does look somewhat better for gen4 than your match, but is easily within each other's error margins.

amj commented 6 years ago

@Error323 can you get your training to run fully-automatic? I.e., auto-start when there's enough new games, have the clients wait if there's too many games, etc?

(btw, first win vs level 1 -- very exciting!)

Error323 commented 6 years ago

Anyway, this net is still solid progress, thank you for your efforts! Are we choosing the always-adopt approach by the way? Sooner or later, a trained net is going to regress against the previous one, should we simply use it anyway at this point?

@glinscott is working on the evaluation code for clients so that will be automated.

Hey @amj! it's highest on my list right now. It would be great to just be able to let it run.

amj commented 6 years ago

@Error323 Nice! I'm not super happy with my solution, it's kinda hacky. Probably the most important part of it is that i have it send me a SMS if the job falls over. Not sure where you're running it (locally? cloud?), but I strongly recommend some sort of monitoring of CPU usage or bandwidth or something that will trip if it dies. Especially if you're as impatient as I am!! :)

Error323 commented 6 years ago

What makes it crash then? Are you killing/restarting the training job every time?

I'm running the training at home using a single GPU (1080ti) atm. But I have two, so I can double the speed almost. I was hoping to come up with a way that would fully automate it. No job restarting. After each training session, ingest the latest 100K chunks or so, reset the learning rates and start again from the previous network.

The problem is it requires testing and experimentation while clients are crunching. I don't want to let their cycles go-to waste :sweat_smiley: I'll start out simple and semi automatic. Improving the pipeline autonomy as we go along.

amj commented 6 years ago

Anything that runs for long enough will have something happen, even if they're just host maintenance events. For me, the job itself is stable but -- especially if a training run will take months -- you will eventually want to update the running code with some new thing that has bugs :)

grolich commented 6 years ago

@glinscott won't it actually take more compute time overall? It's less computation per eval; however, deepmind said it takes longer to converge, not that it just takes more steps.

Is there a prohibitive memory cost to the other representation?

Actually, in a different talk David Silver gave about AlphaZero, he specifically mentioned that they measured in actual time rather than just training steps. And also that the direct representation took longer to converge (well, that appeared in the paper as well).

"A little more time" to converge might be a lot more time when we aren't aided by all their TPUs. Add to that the fact that we don't know how much "little more" time we're talking about, this may be weeks or even a lot more in our setup.

I think the idea that the current representation would be faster is based on an assumption that the process would take just a few more training steps... Which doesn't seem to be the case as far as I can see.

jkiliani commented 6 years ago

I think it's speculation that another representation would really be faster. Even Deepmind say:

We also tried using a flat distribution over moves for chess and shogi; the final result was almost identical although training was slightly slower.

That doesn't sound so bad, if it strongly affected learning, they would have worded this differently. In any case, if at some later point we decided to change representation, this could be done as part of a bootstrap with enforced client update. So far, what we are using seems to work very well, easily visible by the Elo progress compared to Leela Zero. And as long as it works, why change it?

Error323 commented 6 years ago

@amj that's true :). Well I'll be hawking the system anyway and when I stop hearing its fans going I know something is wrong :smile:

And as long as it works, why change it?

We could experiment some with a lower amount of outputplanes going into each head. Currently we're at 32. This results in huge FC layers. We could try 16 or maybe even 8.

So far, what we are using seems to work very well, easily visible by the Elo progress compared to Leela Zero.

This is of course relative performance against itself. We can only know for sure if we also train the other network architecture and play them against eachother at the same level. I think the other output representation converges in less trainingsteps (still requires a lot of simulations which is the bulk of the work), but we are also missing details about how many planes are going into the heads. This was not specified by DeepMind.

Back to gen4: I resumed training after uploading the current net, the rms on the value lowered and the accuracy increased some more. Then I played it (step 240K) against gen3 again, and performance was worse (losing) as if it got stuck in some local optima? Interesting... screenshot from 2018-03-09 09-32-36

grolich commented 6 years ago

@jkilliani I definitely agree that if what is currently done works we should just keep doing it.

Just wondering if a "small" difference in time for them is translated into weeks for us, given the it only took hours for them in total, and the comparison gcp made as far as time to run on standard hardware at the start of the leela zero project (granted, there have been many optimizations since then).

Agreed though, as long as the current solution works, no reason to change it.

Just a potential issue to keep in mind for the future.

Error323 commented 6 years ago

Generation 5:

Score of lc_gen5 vs lc_gen4: 70 - 25 - 5  [0.725] 100
Elo difference: 168.40 +/- 75.60

Trained in 30K steps using the online version, bootstrapped from gen4. Learning rate boundaries:

Training itself is just incredibly slow (this took ~16 hrs) I'm working on optimization improvements now.

jkiliani commented 6 years ago

Very nice progress, but I'm curious about the training speed: Did the online version reduce the speed in some way? By the way, what is the oldest data still part of the window now? Still rng, or gen1 (seed)?

Error323 commented 6 years ago

Yes, online reduced the speed immensely as it needs to parse every chunk (unzip, decode into tensorflow format) before it can be processed by tensorflow and we're only using 1/8th of every chunk when opening. So it does this every single time. The gpu is stalling like crazy while the cpus cannot keep up. Since gen4 we're using a sliding window of 100K chunks. Which is approximately 250K games. RNG has been out of the picture for 2 gens now.

When the binary version works in online mode we'll be able to reduce the ~16 hrs to ~2.5 hrs.

Edit: the nice thing about the online version is that we'll be able to apply the symmetries and we'll make multiple passes through the chunks, utilizing our data very effectively. Which means we might be able to reduce the sliding window further and climb quicker as a result.

jkiliani commented 6 years ago

Since gen4 we're using a sliding window of 100K chunks. Which is approximately 250K games. RNG has been out of the picture for 2 gens now.

I can't quite follow here: We're at 210k games, out of which 180k are from the seed net or newer, and 16 hours ago it was probably around 20k less, i.e. only 160k games from trained networks. How can we use a window size of 250k games, but not have used rng games since gen4?

Error323 commented 6 years ago

Yeah I got the 250K games from using an average of 80 plies per game. That average is incorrect :sweat_smile: sorry. I just checked, the server now has 211'268 chunks.

edit: To be clear, we still left rng 2 gens ago.

glinscott commented 6 years ago

Graph updated, the climb continues :). Congrats!

jkiliani commented 6 years ago

Statistics are still weak, but we now seem to have a slight reverse color bias: More black wins than white.

In any case, it looks like the network training itself is sufficient to rectify color imbalances all by itself, even without exploiting symmetries in training...

jkiliani commented 6 years ago

I made another match, gen5 against gen3:

Score of lc_gen5 vs lc_gen3: 86 - 11 - 3  [0.875] 100
Elo difference: 338.04 +/- 107.79

Consistent if slightly higher than the rating difference from @Error323's match. I also did another try to match gen5 against the weakest setting I could come up for with Stockfish:

./cutechess-cli -rounds 100 -tournament gauntlet -concurrency 2 -pgnout SF0.pgn \
 -engine name=lc_gen5 cmd=lczero arg="--threads=1" arg="--weights=$WDR/gen5-64x6.txt" arg="--playouts=800" arg="--noponder" arg="--noise" tc=inf \
 -engine name=sf_lv1 cmd=stockfish_x86-64 option.Threads=1 option."Skill Level"=1 tc=40/1 \
 -each proto=uci

No luck yet, I cancelled that match at 12 straight losses for Leela Chess. Shouldn't be much longer though until there are at least occasional draws and wins, since @kiudee demonstrated that LCZero can win against a weak Stockfish with 100k playouts.

Error323 commented 6 years ago

@jkiliani could you try 100 games against gen4 instead of gen3? If that's consistently higher then I think my OpenCL errors are to blame. I should probably lower the concurrency at that point.

zz4032 commented 6 years ago

There is a Level 0 in Stockfish: option name Skill Level type spin default 20 min 0 max 20 It's about 1000 Elo based on a tournament between different level versions I did once.

jkiliani commented 6 years ago

@Error323 I cannot rule out the effect you describe, but at this point it seems rather unlikely to me. I think the discrepancy is more likely found in the rating of gen4 vs gen3, since I did a reference test https://github.com/glinscott/leela-chess/issues/100#issuecomment-371645511 that showed better performance for gen4, but still within the error margin of your test. Since either of our gen4 vs gen3 matches were only 50 games, the Null hypothesis seems to be the most plausible explanation here. Out of curiosity, I tried the Fisher exact test (http://www.socscistatistics.com/tests/fisher/Default2.aspx) on our test gen4 vs gen3 matches: It shows a significance level of only 0.24 that the samples are dependent, so we simply have too few games to conclude anything from this test.

There's not much point in trying to prove effects of OpenCL errors unless we're willing to invest into much better statistics than this, i.e. at least several hundred games. Personally I find it unlikely that OpenCL errors would affect strength while the nets are still so weak, but if you think it's likely, I'd recommend you compile a CPU only version of lczero, and match it against the OpenCL version using your supervised network for both sides. This should have the best chance of detecting such a bias if it's there. Comparing matches of different nets between a CPU only setup and an OpenCL setup may fail to prove the effect, since the OpenCL errors should affect the strength of both sides.

At some point we should probably use more robust statistics for the match graph, but for now I don't see the harm since every net so far we clearly stronger than the last, and we're going to end up with an inflated Elo progression in either case.

Error323 commented 6 years ago

Ok we're experiencing some setbacks with respect to the pipeline V2 see #104. I'll train a new net in manual mode for now to make sure people stay invested.