glinscott / leela-chess

**MOVED TO https://github.com/LeelaChessZero/leela-chess ** A chess adaption of GCP's Leela Zero
http://lczero.org
GNU General Public License v3.0
759 stars 298 forks source link

Is it possible to train the value output in a supervised (or even self-play) manner as AZ paper explains? #20

Closed Zeta36 closed 6 years ago

Zeta36 commented 6 years ago

I've been working for a while in this kind of AZ projects. I started even this project some months ago: https://github.com/Zeta36/chess-alpha-zero

And I have a doubt I'd like to ask you. It's about the training of the value output. If I backprop the value output always with an integer (-1, 0, or 1) the NN should quickly be stuck in a local minimal function ignoring the input and always returning the mean of this 3 values (in this case 0). I mean, as soon as the NN learn to return always near 0 values ignoring the input planes, there will be no more improvements since it will have a high accuracy value (>25%) almost intermediately after some steps.

In fact I did a toy experiment to confirm this. As I mention the NN was unable to improve after reaching 33% accuracy (~0.65 loss in mean square). And this has sense if the NN is always returning 0 (very near zero values). Imagine we introduce a dataset of 150 games: ~50 are -1, ~50 are 0 and ~50 are 1. If the NN learns to say always near 0, we get an instant loss of (mse): 100/150 ~ 0.66 and an accuracy of ~33% (1/3).

How the hell did DeepMind to train the value network with just 3 integer values to backpropagate?? I thought the tournament selection (the evaluation worker) was implicated in helping to overcome this local minimum (stabilizing the training), but in its last paper they say they removed the eval process (??)...so, I don't really know what to think.

I don't know neither if the self-play can help in this issue. In last term we are still back-propagating an integer from a domain of just 3 values.

Btw, you can see in our project in https://github.com/Zeta36/chess-alpha-zero we got some "good" results (in a supervised way) but I suspect it was all thanks to the policy network guiding the MCTS exploration (with a value function returning always near 0 values).

What do you think about this?

Zeta36 commented 6 years ago

Accuracy of 93% with a dataset of 500k movements and even so it losses 28 times against a random network? It sounds a little bit strange, sin't it? In here https://github.com/Zeta36 the random network was 100% defeated much (much) more quickly.

Error323 commented 6 years ago

Yeah I agree, it's strange. I'll investigate. @Zeta36 I don't see anything wrong, I parsed the datasets and they seem fine. I guess it just has too many gaps in its knowledge that leave it very vulnerable. Or maybe there's a problem with the UCTSearch? I'm gonna increase the dataset.

glinscott commented 6 years ago

@Error323 I'm playing a match of your 93% network against the network I trained that had reached 33% accuracy from above (uploaded to the lczero-weights repo as best_supervised_5_64.txt.gz), and currently the 33% accuracy is winning handily (11-0-5). Also, I haven't tried the 33% network against the random network, but the previous iteration of this network (at about 20% training accuracy) won 95-1-4 against random network.

I'm not sure what is going wrong though. Perhaps overfitting? But that seems unlikely if the validation split is working.

Error323 commented 6 years ago

Mja it felt to good to be true. I think the dataset I generated contains 500K positions from full games. The sampling method that generated the binary proto data is bad. There are approx 70*80K=5.6M positions in the set of which I sampled only the first 500K with all positions from those games.

We should split the *.gz chunks files into p, (1-p) and then sample pN positions for training and (1-p)N for testing in a truely random fashion.

Error323 commented 6 years ago

Oke, I messed up :( The reason the accuracy kept going up is because the average accuracy was calculated incorrectly: https://github.com/glinscott/leela-chess/blob/master/training/tf/tfprocess.py#L196-L204

glinscott commented 6 years ago

@Error323 ah, good catch. But that's just a constant multiplier, it shouldn't cause a runaway type of effect.

Also, I trained another network up from scratch on the 150k games from gcp, it reached better accuracy than the previous network (they were both measured with the wrong code), but it plays much weaker than the previous best. So something interesting going on here, overfitting is my suspicion.

Here were the results:

step 36000, policy loss=2.77798 mse=0.055978 reg=0.369072 (2820.14 pos/s)
step 36000, training accuracy=41.2207%, mse=0.277543

And then against the current best:

Score of lc_new vs lc_base: 10 - 57 - 33  [0.265] 100
ELO difference: -177
gcp commented 6 years ago

step 36000, policy loss=2.77798 mse=0.055978 reg=0.369072 (2820.14 pos/s) step 36000, training accuracy=41.2207%, mse=0.277543

Note the training MSE=0.05 vs the test MSE=0.27. So this is a total overfit.

https://github.com/glinscott/leela-chess/blob/master/training/tf/tfprocess.py#L84

Try lowering the factor for the mse_loss by a factor 10 there, or increase the regularizer above 10-fold.

Error323 commented 6 years ago

I think the runaway was caused by consecutive positions, which were captured perfectly by the input history planes...

Yeah the MSE is important to get right as it guides the MCTS.

kiudee commented 6 years ago

To expand on what @gcp said for AlphaGo Zero the DeepMind team wrote:

Parameters were optimized by stochastic gradient descent with momentum and learning rate annealing, using the same loss as in equation (1), but weighting the MSE component by a factor of 0.01.

So it appears they determined the effect on the total loss was too high. In the AlphaZero paper they also wrote they reused many of the parameters for chess.

glinscott commented 6 years ago

Thanks @gcp and @kiudee. I'm trying with mse at 0.1 weight now.

glinscott commented 6 years ago

Well, 0.1 weight still seems to be overfitting a bit. But it seems better now. Still, the newly trained networks are losing to my original one.

step 169900, policy loss=2.77919 mse=0.0678728 reg=0.33725 (3426.75 pos/s)
step 170000, policy loss=2.78106 mse=0.0682819 reg=0.337243 (2552.92 pos/s)
step 170000, training accuracy=27.3958%, mse=0.0758204

Match vs best:

Score of lc_new vs lc_base: 7 - 24 - 25  [0.348] 56

image

Zeta36 commented 6 years ago

If the trained network loses against the random one there has to be a profound problem with the model (or with the MCTS implementation). I don't think this issue is due to over-fitting.

Also I think the mse error is scaled down by a factor of 4 so the real one would be: 4*0.08 ~ 0.33, isn't it?

glinscott commented 6 years ago

@Zeta36 it's not losing against random, it defeats it nearly 100%. It's losing against the previous best network I did supervised training with (but on about half the games).

Zeta36 commented 6 years ago

I'm sorry, I misunderstood what you said.

Error323 commented 6 years ago

@glinscott which weights are you using as reference? Then I can try some tests as well and we can compare them properly. https://github.com/glinscott/lczero-weights/blob/master/best_supervised_5_64.txt.gz those?

kiudee commented 6 years ago

I started experimenting with the training of the network and noticed that it is very slow (~4000 steps per hour) on my machine (6 core i7 and GTX 1080). It appears the GPU is almost idle for most of the time, while the worker threads use 100% of my CPUs. @glinscott which machine(s) are you using for training and what wallclock time did you need for the run posted before?

To prevent questions whether tensorflow is using the GPU:

2018-01-18 09:54:25.559595: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: 
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.8855
pciBusID: 0000:01:00.0
totalMemory: 7.92GiB freeMemory: 6.96GiB
2018-01-18 09:54:25.559626: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
Test parse passes
2018-01-18 09:54:26.139863: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
glinscott commented 6 years ago

@glinscott which weights are you using as reference? Then I can try some tests as well and we can compare them properly. https://github.com/glinscott/lczero-weights/blob/master/best_supervised_5_64.txt.gz those?

Yup, it's the ones I uploaded as best_supervised_5_64.txt.gz.

glinscott commented 6 years ago

@glinscott which machine(s) are you using for training and what wallclock time did you need for the run posted before?

@kiudee I had increased the default batch size by a lot to 2048, but I think this may have been a mistake. You might want to try reducing it to 512 or 256. I'm training on a Titan X currently, and it took about a day to train up to 170k steps at 512 batch size.

glinscott commented 6 years ago

I've added the --randomize option in now, and played training games last night with the current best network. I'm going to try training a new network based on those training games, and see how we do :).

Error323 commented 6 years ago

Ok I'm creating a massive dataset from KingBase2017 of 1.5M games with 500K black wins, 500K white wins and 500K draws. I used pgn-extract as suggested by @gcp which is excellent. I'm beginning to think unbalanced datasets are indeed a problem for learning. We may need to remember that when generating games from selfplay.

pgn-extract pgn/mega.pgn -Tr1-0 --nobadresults -D -7 --stopafter 500000 -# 5000 -s
pgn-extract pgn/mega.pgn -Tr0-1 --nobadresults -D -7 --stopafter 500000 -# 5000 -s
pgn-extract pgn/mega.pgn -Tr1/2-1/2 --nobadresults -D -7 --stopafter 500000 -# 5000 -s

Now generating the chunks.

gcp commented 6 years ago

I don't really buy that the unbalanced sets are hard for learning (I mean, they make it harder, for sure, but this should still be trivial stuff for neural nets). But having 1.5M games will improve the overfit problem, though your source data will be a bit weaker.

FWIW the https://sjeng.org/dl/sftrain_clean.pgn.xz dataset is updated to about ~225k games now.

Error323 commented 6 years ago

Well I agree with enough diverse data. But the current set is still quite small given the game complexity. The minimax game tree is sparse given stockfish vs stockfish and a significant portion is drawing. Also in the 150K set, about 10K is duplicate according to pgn-extract.

gcp commented 6 years ago

Also in the 150K set, about 10K is duplicate according to pgn-extract.

Ah, good to know, I'll switch the book to a uniform probability one.

Error323 commented 6 years ago

Maybe it would also be interesting to randomize the depth of the stockfish opponent? Though not using a uniform distribution.

gcp commented 6 years ago

Maybe it would also be interesting to randomize the depth of the stockfish opponent?

It's the latest Stockfish playing itself at a fast timecontrol, the idea was to get a ton of the highest quality games possible (in a reasonable amount of time). I'm assuming 10+0.1 latest Stockfish on fast hardware is better than 2500 GM's at this point. Using a uniform book doesn't really help the variation, so I made the book a bit bigger, but it may be necessary to strip dupes afterwards anyway.

Zeta36 commented 6 years ago

People, I don't want to bother you again with the same thing, but I'm still working in a way to train a supervised value head and as I said it's impossible avoid the model to deep into the easy function of ignoring the output and saying always 0 (the mean with 33% of accuracy):

Epoch 1/1 256/155602 [..............................] - ETA: 16732s - loss: 1.0572 - acc: 0.3359 512/155602 [..............................] - ETA: 16128s - loss: 0.9843 - acc: 0.3555 768/155602 [..............................] - ETA: 17647s - loss: 0.9229 - acc: 0.3646 1024/155602 [..............................] - ETA: 17876s - loss: 0.8946 - acc: 0.3633 1280/155602 [..............................] - ETA: 17758s - loss: 0.8846 - acc: 0.3539 1536/155602 [..............................] - ETA: 17762s - loss: 0.8763 - acc: 0.3496 1792/155602 [..............................] - ETA: 17598s - loss: 0.8758 - acc: 0.3410 2048/155602 [..............................] - ETA: 17480s - loss: 0.8709 - acc: 0.3394 ... 9216/155602 [>.............................] - ETA: 16232s - loss: 0.8413 - acc: 0.3332 9472/155602 [>.............................] - ETA: 16230s - loss: 0.8406 - acc: 0.3336

The only way in which I can avoid this result is over-fitting the network (i.e.; memorizing the games results). But as soon as the movements grow the network is unable to learn anything (and being not able to memorize anymore it goes to the statistic 1/3 solution and stay there).

This last test I show you it's the same of https://github.com/Zeta36 but without the policy head. I mean, the model it's the same (l2 regulator, residual blocks, etc.) but I removed the policy head. The loss shown above is the mse (without scaling by 4 like you do here).

The model converge deeply after a little steps and get stuck in 33% accuracy (because 1/3 of times y=z=0, and the mean of -1,0,1 is also 0).

Maybe your fault in this tests here it's because the head is not able to be trained in a supervised way in a domain of 3 integer values. I know that I removed the policy head and that could be the cause of this fast (and logic) convergence but who knows.

Error323 commented 6 years ago

Hey @Zeta36,

Could you be more specific about the size of the dataset? The exact location of the model used, what the inputs are? Is it this one? https://github.com/Zeta36/chess-alpha-zero/blob/master/src/chess_zero/agent/model_chess.py#L62 if so, where do those 18 planes come from? It seems like you don't store history.

@gcp I had the same idea indeed. We're gonna have to experiment, having the dataset is definitely good! My worry is that the gametree will contain gaps because stockfish vs stockfish contains a single evaluation function. It might be interesting then to select different engine opponents?

kiudee commented 6 years ago

I am currently training the network on the latest stockfish data with mse weight 0.01. Before chunking the games, I filtered all the dupes using pgn-extract. This is the current state of the training: loss

A quick match at 40k steps against the best_supervised_5_64 network resulted in

Score of lc_new vs lc_base: 0 - 34 - 10

So, it’s not yet able to win against it. I will post an update as soon as I have new results.

Error323 commented 6 years ago

So I sampled 1.5M uniqe, correct games from KingBase (1/3 white, 1/3 black, 1/3 draw) from which I generated 7.71M positions by randomly selecting every 16th or 17th pos. The trainingset (75%) and testset (25%) have guaranteed 0 games in common and amount to ~110GiB in raw tensorflow binary format. Final accuracy 40.33%. kb Each dip in the graph represents a reduced learning rate by a factor of 0.1 and MSE loss weight factor was 0.01. It's currently beating best_supervised_5_64 with:

Score of lc_new vs lc_base: 67 - 14 - 19  [0.765] 100

I'm observing that each minor improvement in accuracy on the NN has huge perf gains against the baseline weights. @glinscott would you review PR #36 now as it is verified correct. Weights can be found here.

Zeta36 commented 6 years ago

@Error323 the mse loss in the plot is scaled down by a factor of 4? If that's the case, then the real mse would be 0.17*4 ~ 0.68, isn't it? It's curious that if the NN (the head value) learns to say always near 0, we'd get an instant loss of (mse): 100/150 ~ 0.66 and an accuracy of ~33% (1/3).

Error323 commented 6 years ago

@Zeta36 mse is indeed scaled by a factor of 4. We do observe an accuracy that's higher though (it's quite challenging playing against the network now). I'm generating a pure black/white win dataset which I'll train using the same method to see how that would perform.

glinscott commented 6 years ago

I'm observing that each minor improvement in accuracy on the NN has huge perf gains against the baseline weights. @glinscott would you review PR #36 now as it is verified correct. Weights can be found here.

@Error323 congratulations! That's great progress.

kiudee commented 6 years ago

I have finished my training run on the Stockfish games over the weekend and here are the "final" training curves: screenshot-2018-1-22 tensorboard

I took the network with the highest accuracy (888k steps) and played it against best_supervised_5_64:

Rank Name      Elo    +    - games score oppo. draws 
   1 lc_new     77   30   29   100   74%   -77   35% 
   2 lc_base   -77   29   30   100   27%    77   35% 

My parameters were:

If you want to play around with that network yourself, here are the weights: best_stockfish_5_64.zip

Error323 commented 6 years ago

Tried a deeper network with 128 filters and 10 residual blocks. It only got a few percent better, but those few percent make a huge difference: 128x10

Also, I trained 2 at the same time and I swear my system was about to lift off into space. Results against baseline:

Score of lc_new vs lc_base: 79 - 8 - 13  [0.855] 100

I've got many more results, trying to see what would be a good tradeoff when learning our first network tabula rasa. Where shall I post those results?

kiudee commented 6 years ago

I think it all comes down to the rate at which we are able to produce self-play games. What is your approximate ms/move for a 10x128 network?

Zeta36 commented 6 years ago

Have anybody checked how the model plays after that huge SL training? I mean, DeepMind always report (in Go at least) that after a supervised learning their model was a really good player.

What I want to say is that (in my opinion) it is not so important the loss results (nor the game against a baseline) but the real ELO a huge SL is able to reach with the model you are working on.

I think before going to self-play training we should check the model learn to play chess in a SL manner. In this sense, if the model you are working on is really good (as loss says) after a training of more than 500k movements it should be able to not blunder and to play a really good level game.

Can anybody show over here if the trained model plays well against you (using any Chess GUI app as Arena for example: http://www.playwitharena.com/)?

I comment this because we also got really good loss results in a SL way in https://github.com/Zeta36/chess-alpha-zero but then the model finally never was able to get rid of all blunder movements.

Maybe you are happy with the loss results and then the model is unable to play well even after a 500k training process. If that's the case I don't think a million of self-play games can do much better.

In summary: as DeepMind pointed out a correct model should be able to reach a really impresive ELO just with a SL training process. I do not recommend any self-training in the meantime.

Error323 commented 6 years ago

What is your approximate ms/move for a 10x128 network?

@kiudee It was equal to the 64x5 network, except that my GPU went to 96% utilization. My syste can handle 7200 forward/backward passes per second per GPU.

In summary: as DeepMind pointed out a correct model should be able to reach a really impresive ELO just with a SL training process. I do not recommend any self-training in the meantime.

@Zeta36 I fully agree. I'm still kind of dissapointed with the performance of only 43%. I think we should stick with 64x5 and determine it's real ELO before continuing.

lp-- commented 6 years ago

@Zeta36 Without randomization with fixed number of playouts it is producing single game when playing with itself. Level is extremely low. Seems it picked up pattern of how to put pieces and pawns on board, but don't understand value of pieces, how to mate, etc

Zeta36 commented 6 years ago

It's very complicated to understand how it's possible that a model trained after 500k with a so good convergence is not able to understand the value of the pieces. Something has to be wrong.

Error323 commented 6 years ago

@Zeta36 Have you tried playing the latest version yourself? it's quite fun.

gcp commented 6 years ago

More specifically, if you play a few moves and let it go up a piece (or down a piece), does the evaluation move away from 0? 2 pieces? etc?

What if you give it a large development advantage?

gcp commented 6 years ago

I updated my dataset, there's now about 320K de-duped games. It's hard to conclude from the above, but is this giving better or worse results than human games? I guess to there still being more human data (1.5M games) that is also a bit more varied, that might help a lot?

Zeta36 commented 6 years ago

@Error323 I don't have the environment available right now. Could you please paste over here an animated gif with some games? You can use this online tool: http://www.apronus.com/chess/wbeditor.php

Once you replicate the movements of the game just click "save" and then "animate diagram". Finally you can upload the gif file in a comment in here.

Zeta36 commented 6 years ago

This, for example, was our best achievement (in a SL manner) in https://github.com/Zeta36/chess-alpha-zero

In this game @Akababa (black, ~2000 elo) played against the model (white):

game

By the way, the weights of this model are uploaded in our repo.

jkiliani commented 6 years ago

You may also take a look at this: https://github.com/gcp/leela-zero/issues/696 for a possible strength boost. In general, it looks like good search parameter tuning may make a big difference.

lp-- commented 6 years ago

This is the game it plays with @kiudee weights
./lczero -t 1 -w leelaz-model-888000.txt --start train --noponder -p1600

1. d4 Nf6 2. c4 e6 3. Nc3 Bb4 4. Qc2 O-O 5. a3 Bxc3+ 6. Qxc3 b6 7. Bg5 Bb7
8. e3 d6 9. Ne2 Nbd7 10. Qd3 h6 11. Bh4 Ne4 12. Be7 Qxe7 13. f3 Ng5 14. h4
Nxf3+ 15. gxf3 c5 16. Rg1 Kh8 17. Rxg7 Kxg7 18. d5 Kh8 19. Qc3+ e5 20. Ng3
Qxh4 21. Bd3 Rg8 22. Kf2 Rxg3 23. Rg1 Rg5+ 24. Kf1 Rxg1+ 25. Kxg1 Rg8+ 26.
Kf1 Qh2 27. Ke1 Rg1+ 28. Bf1 Kg8 29. Qd3 Nf6 30. Qf5 Qg2 31. Qd3 Qh2 32.
Qf5 Qg2 33. Qd3 Qxb2 34. Qe2 Ng4 35. fxg4 Qxa3 36. Kf2 Qb3 37. Kxg1 Qb1 38.
Qd3 Qe1 39. Kg2 e4 40. Qxe4 Qd2+ 41. Kf3 Qd1+ 42. Kf2 Qd2+ 43. Kf3 Qc1 44.
Qe8+ Kg7 45. Be2 Bxd5+ 46. cxd5 c4 47. Kf4 c3 48. g5 Qg1 49. gxh6+ Kxh6 50.
Bb5 Kg6 51. Qg8+ Kh6 52. Qxg1 c2 53. Qg5+ Kh7 54. Bd3+ Kh8 55. Bxc2 a6 56.
Qe7 Kg7 57. Qxd6 a5 58. Qb8 b5 59. d6 Kf6 60. Qxb5 Ke6 61. Qc6 Kf6 62. Qb5
Ke6 63. Qc6 Kf6 64. Qc7 Kg7 65. d7 a4 66. Bxa4 Kf6 67. Qb6+ Kg7 68. Qc7 Kf6
69. Qc8 Kg7 70. Kf5 f6 71. d8=Q Kf7 72. Qcc7+
kiudee commented 6 years ago

@gcp Using my weights trained on Stockfish games, it seems to have no concept of material. This is what happens if you gift it a knight in the beginning. The eval does not change and still favors white. To make matters worse, through searching more deeply it starts to favor a worse move.

position startpos moves e2e4 e7e5 g1f3 b8c6 f3e5
go
NN eval=0.452850
Playouts: 4618, Win: 43.42%, PV: c6e5 d2d4 e5c6 b1c3 g8f6 f1d3 d7d6 e1g1 f8e7 d4d5 c6e5 d3e2 e8g8 f2f4 e5g6
Playouts: 9187, Win: 43.64%, PV: c6e5 d2d4 e5c6 b1c3 g8f6 f1d3 d7d6 e1g1 f8e7 d4d5 c6e5 d3e2 e8g8 f2f4 e5d7 g1h1
Playouts: 13780, Win: 43.79%, PV: f8c5 e5f3 g8f6 d2d3 d7d6 b1c3 a7a6 f1e2 e8g8 e1g1 f8e8
Playouts: 18371, Win: 43.86%, PV: f8c5 e5f3 g8f6 d2d3 d7d6 b1c3 a7a6 f1e2 b7b5 e1g1 e8g8 a2a3 f8e8

@Error323 What happens if you do a similar test on your weights trained on human games?

Zeta36 commented 6 years ago

I'm pretty sure the issue comes from the head value.

I've been thinking about training the value output in a supervised way no just with the result game z=[-1,0,1], but with a real score (calculated by Stcokfish). In Python it's very easy to create this dataset, it just involves to use the "python-chess" library to connect to Stockfish and get the evaluation of any board state (in any game inside the PGN files):

handler = chess.uci.InfoHandler()
engine = chess.uci.popen_engine("C:\\xxxxxxx\\stockfish-8-win\\Windows\\stockfish_8_x64.exe")
engine.info_handlers.append(handler)
engine.position(node.board())
engine.go(movetime=1000)
z = handler.info["score"][1].cp / 100.0

Unfortunately I don't have GPU right now to train.

I guess you could do this same thing in C++ easily and check for the model to be able to predict the real value score of a board state. If the head value is not able to converge with this really supervised value dataset, then we would have a problem with the model used (and DeepMind would have probably some secret no commented in their last paper).

gcp commented 6 years ago

I've been thinking about training the value output in a supervised way no just with the result game z=[-1,0,1], but with a real score (calculated by Stcokfish).

People who tried this in Go always got considerably worse results. Even AZ doesn't predict the UCT search outcome value, but the eventual game value, so it seems to apply to chess as well.

Zeta36 commented 6 years ago

But the model value head is precisely the one in charge saying how good a state board is, so I don't know why it should not converge using a real score calculated by Stockfish (??) Do you have some theoretical explanation for what you said (a part of that practical experiments you comment)?

The eventual game result has a complete correlation with the current board state value, isn't it?

Anyway I think it's an easy test you can do in order to figure out if the model is able to learn at least the piece values.