Policy and value heads are from AlphaGo Zero, not Alpha Zero

gcp commented 6 years ago

https://github.com/glinscott/leela-chess/blob/09eb87f76ce85a9a6f9ac697f3abec921e93df0a/training/tf/tfprocess.py#L366

The structure of these heads matches Leela Zero and the AlphaGo Zero paper, not the Alpha Zero paper.

The policy head convolves the last residual output (say 64 x 8 x 8) with a 1 x 1 into a 2 x 8 x 8 outputs, and then converts that with an FC layer into 1924 discrete outputs.

Given that 2 x 8 x 8 only has 128 possible elements that can fire, this seems like a catastrophic loss of information. I think it can actually only represent one from and one to square, so only the best move will be correct (and accuracy will look good, but not loss, and it can't reasonably represent MC probabilities over many moves).

In the AGZ paper they say: "We represent the policy π(a|s) by a 8 × 8 × 73 stack of planes encoding a probability distribution over 4,672 possible moves." Which is quite different.

They also say: "We also tried using a flat distribution over moves for chess and shogi; the final result was almost identical although training was slightly slower."

But note that for the above-mentioned reason it is almost certainly very suboptimal to construct the flat output from only 2 x 8 x 8 inputs. This works fine for Go because moves only have a to-square, but chess also has from-squares. 64 x 8 x 8 may be reasonable, if we forget about underpromotion (we probably can).

The value head has a similar problem: it convolves to a single 8 x 8 output, and then uses an FC layer to transform 64 outputs into...256 outputs. This does not really work either.

The value head isn't precisely described in the AZ paper, and a single 1 x 8 x 8 is probably good enough, but the 256 in the FC layer make no sense then. The problems the value layer has right now might have a lot to do with the fact that the input to the policy head is broken, so the residual stack must try to compensate this.

gcp commented 6 years ago

Thinking about it, the value head coming from a single 1 x 8 x 8 means it can only represent 64 evaluations. This would already be little for Go, where ahead or behind can be represented in stones. But for chess, where we often talk about centipawns, it's even worse.

Error323 commented 6 years ago

OMG How did I miss this :scream:. Kind of amazing it does what it does right now...

kiudee commented 6 years ago

I am also wondering now which kind of value head they used in AlphaZero, since it is not written in the paper. I will try a few architectures.

Error323 commented 6 years ago

@gcp According to the AlphaGoZero paper, the network also applies a batchnorm and ReLu in both heads. Why did you skip this?

The output of the residual tower is passed into two separate ‘heads’ for computing the policy and value. The policy head applies the following modules: (1) A convolution of 2 filters of kernel size 1 × 1 with stride 1 (2) Batch normalization (3) A rectifier nonlinearity (4) A fully connected linear layer that outputs a vector of size 19 2 + 1 = 362, corresponding to logit probabilities for all intersections and the pass move The value head applies the following modules: (1) A convolution of 1 filter of kernel size 1 × 1 with stride 1 (2) Batch normalization (3) A rectifier nonlinearity (4) A fully connected linear layer to a hidden layer of size 256 (5) A rectifier nonlinearity (6) A fully connected linear layer to a scalar (7) A tanh nonlinearity outputting a scalar in the range [−1, 1]

kiudee commented 6 years ago

We definitely need some kind of non linearity in the fully connected layer. Otherwise we will only ever be able to learn a linear function for each head. I also see no reason not to apply batch normalization, but that is not as crucial.

F. Huizinga notifications@github.com schrieb am Mi., 24. Jan. 2018, 20:22:

@gcp https://github.com/gcp According to the AlphaGoZero paper, the network also applies a batchnorm and ReLu in both heads. Why did you skip this?

The output of the residual tower is passed into two separate ‘heads’ for computing the policy and value. The policy head applies the following modules: (1) A convolution of 2 filters of kernel size 1 × 1 with stride 1 (2) Batch normalization (3) A rectifier nonlinearity (4) A fully connected linear layer that outputs a vector of size 19 2 + 1 = 362, corresponding to logit probabilities for all intersections and the pass move The value head applies the following modules: (1) A convolution of 1 filter of kernel size 1 × 1 with stride 1 (2) Batch normalization (3) A rectifier nonlinearity (4) A fully connected linear layer to a hidden layer of size 256 (5) A rectifier nonlinearity (6) A fully connected linear layer to a scalar (7) A tanh nonlinearity outputting a scalar in the range [−1, 1]

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/glinscott/leela-chess/issues/47#issuecomment-360244514, or mute the thread https://github.com/notifications/unsubscribe-auth/AAic-Y1ojSpYzhqD_9FK4vUJ210TttRSks5tN4L0gaJpZM4Rrl0X .

Error323 commented 6 years ago

No never mind, I didn't read correctly. He does apply both BN and the ReLu, it's encoded in the conv_block function. Sorry.

Error323 commented 6 years ago

Ok, given the above I'm trying out two different networks:

1. NN 64x5

Policy Head

A convolution of 32 filters of kernel size 1 × 1 with stride 1
Batch normalization
A rectifier nonlinearity
A fully connected linear layer that outputs a vector of size 1924

Value Head

A convolution of 32 filters of kernel size 1 × 1 with stride 1
Batch normalization
A rectifier nonlinearity
A fully connected linear layer to a hidden layer of size 256
A rectifier nonlinearity
A fully connected linear layer to a scalar
A tanh nonlinearity outputting a scalar in the range [−1, 1]

2. NN 128x5

Policy Head

A convolution of 73 filters of kernel size 1 × 1 with stride 1
Batch normalization
A rectifier nonlinearity
A fully connected linear layer that outputs a vector of size 1924

Value Head

A convolution of 32 filters of kernel size 1 × 1 with stride 1
Batch normalization
A rectifier nonlinearity
A fully connected linear layer to a hidden layer of size 256
A rectifier nonlinearity
A fully connected linear layer to a scalar
A tanh nonlinearity outputting a scalar in the range [−1, 1]

glinscott commented 6 years ago

Wow, great call @gcp. Glad you caught this before we kicked off the distributed learning process.

Error323 commented 6 years ago

Using the above networks I'm reaching an accuracy of 45% and 52% respectively. newnets

There is a strange periodicity that I'm not sure about. I'm using a shufflebuffer of 2^18 and 16 prefetches.

Thoughts?

Zeta36 commented 6 years ago

I don't want to bother you again with the same thing, but your MSE plot is again near 0.15 that, scaled by 4, gives ~0.6 (you should stop scaling the MSE loss by the way, al least while you are studying the value head real convergence).

That loss of around ~0.6 (given we are working with 3 integer outputs z=-1,0,1) means the NN learns nothing but a statistical mean response.

I repeated this lot of times in the other post:

In fact I did a toy experiment to confirm this. As I mention the NN was unable to improve after reaching 33% accuracy (~0.65 loss in mean square). And this has sense if the NN always return very near mean values. Imagine we introduce a dataset of 150 games: ~50 are -1, ~50 are 0 and ~50 are 1. If the NN learns for example to say always near 0, we get a loss of (mse): 100/150 ~ 0.66

kiudee commented 6 years ago

I also started training a network (on the stockfish data) which is at mse=0.09 (on the test set) after 10k steps. I will see if I observe the same periodicity.

Error323 commented 6 years ago

@Zeta36 I know, I didn't want to alter the result outputs without having a merge into master. Otherwise comparing results is just more confusing.

Error323 commented 6 years ago

@kiudee Did you also use a value head weight of 0.01?

kiudee commented 6 years ago

@Error323 Yes, but ignore my results for now. I think there was a problem with the chunks I generated. I will report back as soon as the chunks are generated (probably tomorrow).

kiudee commented 6 years ago

@Error323 Somewhere, there must be a bug. I re-chunked the Stockfish data, converted it to train/test and let it learn for a night and got the following (too good to be true and likely massively overfit) result: screenshot-2018-1-26 tensorboard And yes, the network basically plays "random".

I am using the following config:

name: 'kb1-64x5'                       # ideally no spaces
gpu: 0                                 # gpu id to process on

dataset: 
  num_samples: 352000                  # nof samples to obtain
  train_ratio: 0.75                    # trainingset ratio
  skip: 16                             # skip every n +/- 1 pos
  input: './data/'                     # supports glob
  path: './data_out/'                  # output dir

training:
  batch_size: 512
  learning_rate: 0.1
  decay_rate: 0.1
  decay_step: 100000
  policy_loss_weight: 1.0
  value_loss_weight: 0.01
  path: '/tmp/testnet'

model:
  filters: 64
  residual_blocks: 5

Error323 commented 6 years ago

Hmmm how did you generate the chunks and how many games are there? Given n chunks and skip size s you should generate (n*15000)/s samples. This makes sure you're sampling across the entire set of chunks.

Over here things start to look promising @Zeta36 res The new network (red) was forked from the orange and the loss weight on the value head set to 1. Learning rate is set to 0.001 until the 600K'th step, after which it will decay to 0.0001. We are now dropping well below the statistical mean response on the MSE.

jkiliani commented 6 years ago

@gcp is just implementing a validation split for a Tensorflow training code in Leela Zero (https://github.com/gcp/leela-zero/pull/747), and it fixed a lot of problems for the reinforcement learning pipeline there. You might check out if any of this is relevant for you as well...

gcp commented 6 years ago

They already have this fix in their training code (the skip in the configuration), I think. As well as the validation splits. The training data generation was changed quite a bit to deal with the 10x increase in inputs and even faster network evaluation (8x8 vs 19x19) you get in chess.

kiudee commented 6 years ago

@Error323 I have 352000 games which result in 3354 chunks. So, from your equation I should set the number of samples to 3144375?

Zeta36 commented 6 years ago

Yes, it looks promising, @Error323. Can you check the game level of that network looking for some kind of chess strategy learning? If you could compare the (strategy) game level of the orange NN against the red new one it'd great.

If the red network really learned something beyond the mean I think we should already see it (probably it'll be easier to see this improvement after the opening phase is over -the policy head is really strong and determinant in the first dozen movements-).

kiudee commented 6 years ago

I found the bug that was causing the random play:

    innerproduct<32*width*height, NUM_VALUE_CHANNELS>(value_data, ip1_val_w, ip1_val_b, winrate_data);

I forgot to adjust the sizes of the inner products which failed silently. I think we should replace these magic constants by global constants.

Error323 commented 6 years ago

I'm struggling with the same hah, crap. I fully agree on the magic constants to global. Any more places you're seeing problems? I'm using 73 output planes in both heads. I don't have segfaults anymore but am wondering whether I fixed everything correctly.

https://gist.github.com/Error323/46a05ab5548eaeac95916ea428dd9dec @gcp @glinscott did I miss anything?

glinscott commented 6 years ago

@Error323 one way to validate is to run it under asan. I do that with the debug build, and -fsanitize=address passed to both compiler and linker.

Your changes look correct to me though.

kiudee commented 6 years ago

@Error323 Looks good. After adjusting those places leela-chess started to play sane chess for me.

Error323 commented 6 years ago

Ok so I'm still running an evaluation tournament of "orange" vs "red" in the graph above. Orange having higher accuracy and higher MSE. Red having lower accuracy but also lower MSE. So far the score is in favor of Orange with 23 - 43 - 25. Games can be found here: https://paste.ubuntu.com/26466921/

I know it's only 800 playouts, but I still think this is inferior play given the network and the dataset. Both still just give away material or fail to capture important material for the taking. So I don't think we're ready for self play yet. The networks need to be better.

Currently trying out a 64x5 network and 128x5 with 0.1 loss weight on the value head. Value head is 8x8x8 -> 64 -> 1 and policy head is 32x8x8 -> 1924. Let me know if you have different ideas/approaches.

glinscott commented 6 years ago

@Error323 interesting! What happens if you take a normal midgame position, run the network on it, and then eg. remove the queen, and re-run. Does the win percentage change?

Error323 commented 6 years ago

@glinscott Using the midgame from @kiudee it does seem to be well aware of the queen.

position fen 8/4R2p/6pk/5p2/8/4P1KP/1r3PP1/8 w - - 0 48
go
eval=0.669482, 801 visits, 11247 nodes, 801 playouts, 468 n/s
bestmove e7f7
position fen Q7/4R2p/6pk/5p2/8/4P1KP/1r3PP1/8 w - - 0 48                                                                               
go
eval=0.990093, 801 visits, 16811 nodes, 801 playouts, 479 n/s
bestmove a8a7

And the same initial FEN string with 16K playouts:

Playouts: 15654, Win: 75.49%, PV: e7c7 b2b1 g3f3 b1f1 g2g4 f5g4 h3g4 h6g5 f3g2 g5g4 g2f1 g4f3 c7h7 g6g5 h7h6 g5g4
eval=0.669482, 16001 visits, 239960 nodes, 16001 playouts, 443 n/s
bestmove e7c7

Error323 commented 6 years ago

I think at this point there are two valid approaches.

Make sure we obtain a good functioning network through supervised learning. This is partially uncharted territory as DeepMind has not released specifics on their network, nor attempted to perform supervised learning with chess. From results thusfar it seems like it's a delicate balance between the hyperparameters (loss weights, learning rate, decay function). Also, how do we define good? When are we confident enough?
Go for the reinforcement learning selfplay approach on a small network i.e. 64x5. As we don't have a dedicated datacenter with a steady rate we should brew our own learning-rate-decay function that observes the gradients of the losses and alters the learning rate accordingly. We could look into further optimizations for faster game simulations, e.g. TensorRT for NVIDIA cards, this might produce a significant boost in performance with the INT8 operators, did you look into this yet @gcp?

Either way, with both approaches it might be good to build the learning-rate-decay function. @glinscott shall I give that a try?

Error323 commented 6 years ago

Sorry I keep rambling, I can't stand the results >:(. I went back to the AlphaGo Zero paper and it seems that for supervised learning they trained a 256x20 network for 3 days with the following results:

60.4% move prediction accuracy on the testset
0.185 game outcome prediction error on the testset (MSE)

Steps	Learning rate
0-200	0.1
200-400	0.01
400-600	0.001
600-700	0.0001
700-800	0.00001
>800	- (I don't know what this means, they stopped?)

However, I forgot to incorporate the fact that they use minibatches of 2048, I'm using 512. So I should probably multiply the nof steps by 4...

Zeta36 commented 6 years ago

0.185 MSE scaled down by 4 or real one?

Error323 commented 6 years ago



Edit: No, wait. The Go game doesn't have draw games i.e. only outcomes in {1,-1}. So the statistical mean response scaled to [0,1] is 0.5. So given the graph in their paper, it probably is the scaled down version between [0, 1].

Zeta36 commented 6 years ago

@Error323, I think probably the only way to stabilize the value output (avoiding the convergence into a statistical output) has to be the use of the evaluation process (and yes, I know in AZ they removed with no much explanation this step).

We could need the evaluator worker discriminating between equally strong networks (the policy head learns very well and it converges soon with no much more space for improvement), basing the selection of the network in their head responses. Each evaluation process the evolutionary-like tournament selection would select the network that learned something (in the value head) beyond the statistical data in the z domain.

I've tried in these days (in a local keras implementation) lots of combinations, configurations and models and I was unable to make a value head learning a minimal chess strategy. All the apparent improvement in the game was always due to the policy learning. And the head output always tended to find a statistical approximation looking for the mean in the dataset.

@Error323, do you think it'd be too complex for you to develop an evaluation process in a pipeline like this:

Read n SL movements from PGN files.
Train the best model (in the begging it'd be random) with this data.
Evaluate the best model against this new generation candidate. If new generation wins > 55% replace best model, etc.

i.e., could you try to run the AZ0 pipeline fully but using SL data instead of self-play one in order to check in a fast way the evolution of the head value?

If I'm correct, after some changes in the best model the network should start showing real (and easy to see) chess evaluation results. Also, an independent test dataset should start to improve the (value) loss every change in the best model.

Error323 commented 6 years ago

@Error323, I think probably the only way to stabilize the value output (avoiding the convergence into a statistical output) has to be the use of the evaluation process (and yes, I know in AZ they removed with no much explanation this step).

I have to disagree, because indeed AZ achieved this. Secondly in the AlphaGo Zero paper they claim that their dual head residual network improved the value MSE with supervised learning. I quote:

Combining policy and value together into a single network slightly reduced the move prediction accuracy, but reduced the value error and boosted playing performance in AlphaGo by around another 600 Elo. This is partly due to improved computational efficiency, but more importantly the dual objective regularizes the network to a common representation that supports multiple use cases.

And thirdly, I observe some correlation between the accuracy and value MSE. When the MSE goes up, the accuracy goes down.

I think I was being too impatient with my training processes. I'm currently training a 64x6 network with 32 planes going into each head, using the exact same hyper parameters as AGZ. A batch size of 2048 and dropping every 200K steps with a factor of 0.1. But with a significant smaller network, the value MSE is already below the statistical mean response. new Furthermore this is using the KingBase dataset of 1.5M games, but I changed the testset to a randomly sampled 5% of the entire dataset (disjoint set of games though). Therefore the accuracy bounces around a lot at the start.

Semi-offtopic @Zeta36 You never answered my question regarding your 18 input planes in your network instead of the 120 here. Also, from observing your model, it seems you're not taking the topic of this issue into account. These two points would render all your current results pretty useless IMHO.

kiudee commented 6 years ago

@Error323 Do you have some weights you can upload for the current training run?

Zeta36 commented 6 years ago

@Error323, I know all what DeepMind says but you know, I have to see it to believe it ;). The AZ paper is short and strange. We'll have to wait for an extension to be published.

About your new results I ask you the same as always: can you see any improvement in the game clearly related to the model learning chess strategies? If you let the model play after the opening movements (when the policy head starts to fail in its predictions), do you see any minimal learning features about the value of the board state? can you share some example of this?

Error323 commented 6 years ago

@Error323 Do you have some weights you can upload for the current training run?

https://github.com/glinscott/lczero-weights but note that this is far from done, however it has been trained for over 22hrs now. Btw the weights in txt format are now ~50MiB.

@Zeta36 I'll get back to you on that when the model is done :) which will take another 40 hrs approximately. You can play with the intermediate result yourself at the above location.

edit: @kiudee it might make sense to compile your own OpenBLAS if you have not done so already. Our heads are computationally quite intensive compared to leela-zero. OpenBLAS tuned to your specific hardware really makes a difference with the SIMD instructions.

Error323 commented 6 years ago

Ok this is exciting. Given the uploaded network above, I used the FEN from @kiudee 8/4R2p/6pk/5p2/8/4P1KP/1r3PP1/8 w - - 0 48 and let leela-chess think for 400K playouts. It's PV at the end was:

Playouts: 399664, Win: 72.27%, PV: f2f4 b2b1 g3f2 g6g5 f4g5 h6g5 e7h7 g5f6 h7c7 f6e5 c7c4 e5d5 c4d4 d5e5 d4c4 e5d5

Next I let stockfish (latest gen) be white with 1min think time per move. And it corresponded exactly up to the 8th ply. chess

glinscott commented 6 years ago

@Error323 this is fantastic! I noticed that the speed went way down when I tested out a 64 channel network, although it should be less of a hit to move to 32. I'm thinking we probably have to move that part off to the GPU portion now, which shouldn't be too difficult.

Error323 commented 6 years ago

The network (white) at 496K steps 8K playouts against gnuchess tc=30/1: apronusdiagram1517214535 It's very agressive as you can see. But I'm pleased :-) I think we can really start transitioning to self play.

Zeta36 commented 6 years ago

@Error323, you've done a really good work. Now I feel like a fool :(.

And yes, it's sure time to start with self-training. You can count on me.

Btw, are you thinking about starting the self-training from scratch or we should start from your SL model and improve it by self games (as DeepMind did in its first version of AlphaGo)?

Error323 commented 6 years ago

@Zeta36 No man, your comments and criticism helped a lot! Don't be silly, thank you!

Btw, are you thinking about starting the self-training from scratch or we should start from your SL model and improve it by self games (as DeepMind did in its first version of AlphaGo)?

I'd really like to start from scratch, I think this "simple" network should already do quite well. But it'll be tricky (requires some thinking/experimentation) to determine when to drop the learningrate.

lp-- commented 6 years ago

@Error323 Which weights are these? Those you posted above?

Error323 commented 6 years ago

@lp-- no, same model but later trainingsteps. Currently the training process is at 540K steps - 1d 19h of training. I'll post the final results when it's done (~800K steps), and then I would suggest to make this our reference network to validate selflearning against.

glinscott commented 6 years ago

@Error323 that is fantastic :). That's a really nice game, shows good knowledge of important king defenders especially at the end there. A huge step forward in strength! Congratulations!

I'm getting closer and closer to having the client and server ready for the self-learning run, I have the DB schema mostly set up, and starting to write some tests for it. It's a golang + postgresql backend, pretty standard stuff, so shouldn't take too much longer.

Error323 commented 6 years ago

I'm getting closer and closer to having the client and server ready for the self-learning run, I have the DB schema mostly set up, and starting to write some tests for it. It's a golang + postgresql backend, pretty standard stuff, so shouldn't take too much longer.

Nice! I think we should also start to focus more on the C++ side again now. The nodes per second needs to go up. I'm experimenting with caffe and TensorRT in #52 and I definitely think that the heads should go to the GPU as you suggested in #51. Before we launch the self-learning we should be as fast as possible to utililize all those precious cycles as best we can :) I was wondering what kind of GPU's people here have? Also we need Windows testers.

gcp commented 6 years ago

We could look into further optimizations for faster game simulations, e.g. TensorRT for NVIDIA cards, this might produce a significant boost in performance with the INT8 operators, did you look into this yet @gcp?

The problem of TensorRT is that it requires optimization to the specific weights of the network. It's OK once you have a trained network, but not if you're still training it or have variable weights. (I guess the question is if you could pre-transform the networks on the server machine - that might work)

INT8 support is exclusive to the 1080 GTX (Ti or not I'm not sure?) only, AFAIK.

Anything that depends on cuDNN also requires the end user to make an account on NVIDIA's site and download themselves due to licensing restrictions. I don't know how TensorRT's license looks.

It's also NVIDIA-only. I try to stay as far away as possible from vendor lock in.

But some people have made a version of Leela Zero that does the network eval by a TCP server and have that calculated by Theano using cuDNN, for example. If you don't care for end-user setup complexities more things are possible.

gcp commented 6 years ago

@Error323 this is fantastic! I noticed that the speed went way down when I tested out a 64 channel network, although it should be less of a hit to move to 32. I'm thinking we probably have to move that part off to the GPU portion now, which shouldn't be too difficult.

There are implementations of this in Leela Zero's pull requests. (They were not merged because for Go there was no gain)

https://github.com/gcp/leela-zero/issues/185

Error323 commented 6 years ago

The problem of TensorRT is that it requires optimization to the specific weights of the network. It's OK once you have a trained network, but not if you're still training it or have variable weights. (I guess the question is if you could pre-transform the networks on the server machine - that might work)

Indeed, that was the thinking. Pretransform once per new network and deploy across all nv based workers.

INT8 support is exclusive to the 1080 GTX (Ti or not I'm not sure?) only, AFAIK.

I believe it's also available on latest titans and their Volta architecture. Maybe 1080 too? Not sure about that though.

Anything that depends on cuDNN also requires the end user to make an account on NVIDIA's site and download themselves due to licensing restrictions. I don't know how TensorRT's license looks.

This fact reaaaaaaaaaaaally sucks :(

It's also NVIDIA-only. I try to stay as far away as possible from vendor lock in.

I understand indeed, this is good. I just want to squeeze every drop of performance out of my nv cards.

But some people have made a version of Leela Zero that does the network eval by a TCP server and have that calculated by Theano using cuDNN, for example. If you don't care for end-user setup complexities more things are possible.

This sounds very inefficient?

Error323 commented 6 years ago

After training for 2 days and 20 hours, the network is done. Below you can see the tensorboard graphs. final In order to see if training for so long helped, I did some quick experiments with various networks against gnuchess tc=30/1:

Steps	Playouts	Rounds	lczero vs gnuchess
796K	40	10	0 - 9 - 1
296K	4000	10	3 - 6 - 1
796K	4000	10	8 - 1 - 1

kbb-net.zip contains the weights, config yaml file and results. With your permission @glinscott I'd like to suggest we hereby close #20.

glinscott / leela-chess

Policy and value heads are from AlphaGo Zero, not Alpha Zero #47

1. NN 64x5

2. NN 128x5