Is it possible to train the value output in a supervised (or even self-play) manner as AZ paper explains?

Zeta36 commented 6 years ago

I've been working for a while in this kind of AZ projects. I started even this project some months ago: https://github.com/Zeta36/chess-alpha-zero

And I have a doubt I'd like to ask you. It's about the training of the value output. If I backprop the value output always with an integer (-1, 0, or 1) the NN should quickly be stuck in a local minimal function ignoring the input and always returning the mean of this 3 values (in this case 0). I mean, as soon as the NN learn to return always near 0 values ignoring the input planes, there will be no more improvements since it will have a high accuracy value (>25%) almost intermediately after some steps.

In fact I did a toy experiment to confirm this. As I mention the NN was unable to improve after reaching 33% accuracy (~0.65 loss in mean square). And this has sense if the NN is always returning 0 (very near zero values). Imagine we introduce a dataset of 150 games: ~50 are -1, ~50 are 0 and ~50 are 1. If the NN learns to say always near 0, we get an instant loss of (mse): 100/150 ~ 0.66 and an accuracy of ~33% (1/3).

How the hell did DeepMind to train the value network with just 3 integer values to backpropagate?? I thought the tournament selection (the evaluation worker) was implicated in helping to overcome this local minimum (stabilizing the training), but in its last paper they say they removed the eval process (??)...so, I don't really know what to think.

I don't know neither if the self-play can help in this issue. In last term we are still back-propagating an integer from a domain of just 3 values.

Btw, you can see in our project in https://github.com/Zeta36/chess-alpha-zero we got some "good" results (in a supervised way) but I suspect it was all thanks to the policy network guiding the MCTS exploration (with a value function returning always near 0 values).

What do you think about this?

glinscott commented 6 years ago

Hi @Zeta36! Yes, getting the network away from draws does seem to be a challenge. First step is to validate that with supervised training we can get to a reasonable strength, and the search is not making any huge blunders. Then potentially doing some self-play with that network and seeing if it can learn to improve itself.

Zeta36 commented 6 years ago

DeepMind in the paper does not say they get rid of the draws. Moreover, they say they count as a draw any game taking more than n number of movements (n>=average movements in a chess game) so the bias for the NN stuck in a local minimum saying always 0 would be even greater.

I don't know it's all a very strange.

Error323 commented 6 years ago

I do think the dual head helps, both the probability head and the value head are backpropagated. What happens when you remove the draw games?

Zeta36 commented 6 years ago

But the dual head ends the value part with an independent FC layer than can be easily backprop to have weights near to zero so the head would return always 0 at the same time that would not affect too much the rest of the common network. I'm not so sure the dual head can solve the problem.

glinscott commented 6 years ago

The network does appear to (slowly) be getting better from the supervised learning. This happened after I dropped the learning rate from 0.05 to 0.005.

Latest results:

step 576000, policy loss=2.67612 mse=0.0616492 reg=0.400642 (3618.78 pos/s)
step 576000, training accuracy=28.0371%, mse=0.0602521

Error323 commented 6 years ago

@glinscott can you examine the FC layer on the value head in the latest weights.txt? Show some statistics? I agree with @Zeta36 it would indeed be problematic if they converge to 0.

I'll see if I can write out tf.example bytes and learn from that. Good idea.

Zeta36 commented 6 years ago

@glinscott you have to focus on the mse loss (the head value part). If I'm correct, the model (the head part) should converge fast (around 0.66, with near 33% of accuracy) and get stuck in returning always (very near) zero values.

glinscott commented 6 years ago

@Zeta36 when playing games, the network is definitely returning values not close to zero. Eg. in the following position, we can see the network is learning that white is winning. Unfortunately, the policy has it playing Ne5, which is a repetition...

1r1n4/1p1P1b2/2p1p1nk/PPP3rp/3PPP2/3N3P/1R2N1B1/3Q1RK1 w - - 7 45
eval: 0.892773
Ne5 0.549349
Rb4 0.069966
fxg5+ 0.046496
Qa4 0.042237
Qc2 0.037499
Nb4 0.031261
Rb3 0.025240

Zeta36 commented 6 years ago

@glinscott do you have a summary of your current value head loss (and accuracy)?

glinscott commented 6 years ago

Here are the latest training steps:

step 580100, policy loss=2.64198 mse=0.0594456 reg=0.395409 (2585.08 pos/s)
step 580200, policy loss=2.64212 mse=0.0593968 reg=0.395284 (2826.09 pos/s)
step 580300, policy loss=2.64205 mse=0.0591733 reg=0.395159 (3744.2 pos/s)
step 580400, policy loss=2.64157 mse=0.0593887 reg=0.395034 (3165.17 pos/s)
step 580500, policy loss=2.64044 mse=0.0594478 reg=0.394909 (3733.18 pos/s)
step 580600, policy loss=2.63911 mse=0.0594092 reg=0.394784 (3285.71 pos/s)
step 580700, policy loss=2.63862 mse=0.0591817 reg=0.39466 (3275.92 pos/s)
step 580800, policy loss=2.63818 mse=0.0592512 reg=0.394536 (3754.25 pos/s)
step 580900, policy loss=2.6374 mse=0.0594759 reg=0.394412 (3215.41 pos/s)
step 581000, policy loss=2.63621 mse=0.0593435 reg=0.394289 (3251.14 pos/s)
step 581100, policy loss=2.63555 mse=0.0594918 reg=0.394166 (3780.63 pos/s)
step 581200, policy loss=2.6349 mse=0.0597201 reg=0.394043 (3297.47 pos/s)
step 581300, policy loss=2.63339 mse=0.0596737 reg=0.393921 (3817.32 pos/s)
step 581400, policy loss=2.6337 mse=0.0595412 reg=0.393798 (3266.03 pos/s)
step 581500, policy loss=2.6338 mse=0.0591958 reg=0.393676 (3276.31 pos/s)
step 581600, policy loss=2.63308 mse=0.0587861 reg=0.393554 (3790.01 pos/s)
step 581700, policy loss=2.63232 mse=0.0587501 reg=0.393433 (3224.15 pos/s)
step 581800, policy loss=2.6325 mse=0.0587086 reg=0.393312 (3715.49 pos/s)
step 581900, policy loss=2.6318 mse=0.0589177 reg=0.393191 (3213 pos/s)
step 582000, policy loss=2.6317 mse=0.0587718 reg=0.39307 (3260.37 pos/s)
step 582000, training accuracy=29.0234%, mse=0.0632076

The MSE loss is divided by 4 to match Google's results though.

Zeta36 commented 6 years ago

hmmmm...well, maybe I'm wrong but I don't know why I'm wrong. It should get stuck. It's simple statistics. You are using supervised data, aren't you? Do you know if the PGN you are using is biased in some direction (with little or null zero games or something like that)? Also, how many movements are you using for the optimization? If you are using little (or biased) movements the NN could find a way to escape the convergence into 0.

glinscott commented 6 years ago

The pgn is from @gcp, who used SF self-play games (https://sjeng.org/dl/sftrain_clean.pgn.xz). Stats indicate it's a normal sample of high-level chess games.

   6264 [Result "0-1"]
  15382 [Result "1-0"]
  53322 [Result "1/2-1/2"]

Zeta36 commented 6 years ago

Mmmm, perhaps it's a normal sample of high-level chess games, but there is an statistical bias in those results. Too much white wins versus black. That could cause the model to escape the fast convergence into zero.

I don't want to bother you too much, but it would be great to check with a 33% (white wins), 33% draws and 33% black wins. Maybe it would be just enough training with the same number of 0-1 and 1-0 results to rule out my position.

And this could be important because with a random initial sel-play model there will be probably same white and black wins in the beginning (what could cause if I'm correct a fast and bad convergence of the value head).

glinscott commented 6 years ago

@Zeta36 agreed, I do want to implement that. But the network at least appears to be learning well with these games so far:

And the latest results of the test match against the random network: Score of lc_new vs lc_base: 24 - 1 - 8 [0.848] 33

Akababa commented 6 years ago

Nice results! Did you have a chance to play against the model yourself to see if it may be overfitting to "high-level" games? I believe I had a problem early on where all my training data had little material variance so the model had no chance to learn material imbalances (and the idea behind introducing variance/dirichlet noise to the self-play generator could partly be to "remind" the network what bad positions look like, improving training stability).

Another idea to eliminate the white/black bias that @Zeta36 brought up is to flip the board (isomorphic transform the space of "black-to-move" positions into "white-to-move" positions). I think this also introduces extra regularization at no cost.

glinscott commented 6 years ago

It's definitely still making a ton of mistakes. Here is an example game with the new network as white, random weights as black:

1. d4 e6 2. c4 d6 3. Nc3 {0.52s} Bd7 {0.51s} 4. e4 {0.54s} Qe7 5. Nf3 {0.50s}
Nh6 6. Be2 f6 7. O-O f5 8. exf5 g5 9. fxg6 Bb5 10. cxb5 Qf6 11. Re1 Qxg6 12. Bd3
Ke7 {0.71s} 13. Bf4 {0.50s} Kf6 14. a4 Be7 {0.50s} 15. a5 Rf8 16. h3 Qh5
17. Bc4 {0.53s} Qc5 18. dxc5 {0.52s} Kf5 19. Qd2 {0.52s} Kg6 20. Bxe6 {0.52s}
Kh5 21. cxd6 {0.50s} Bf6 22. Nd5 {0.51s} Bc3 {0.50s} 23. Qxc3 Rh8 24. Qe5+ Nf5
25. Bf7# {White mates} 1-0

You can see that after 12. Bd3 Ke7 the queen is hanging, but the network doesn't see it. I suspect setting the network to predict just the move SF played is hurting here. When training in self-play, it learns to predict the probabilities of all the moves visited by UCT, which seems much more robust. Still this seems like a solid baseline. Good proof the code is mostly working too :).

I've uploaded these weights to https://github.com/glinscott/lczero-weights/blob/master/best_supervised_5_64.txt.gz if others are interested.

The match score against the random network was:

Score of lc_new vs lc_base: 95 - 1 - 4  [0.970] 100
ELO difference: 604

(amusingly it lost the very last game)

Zeta36 commented 6 years ago

Curiously the game seems to play more or less like @Akababa best results. Did your model already converge @glinscott? Maybe we could be facing some kind of limitation in the model we are using (following AZ0).

glinscott commented 6 years ago

@Zeta36 I don't think it's converged yet, but I've had to drop the training rate twice so far. So probably getting close:

Zeta36 commented 6 years ago

I see. And don't you think it's a little strange that with so a very good convergence (in both policy and value head) the model still does so many (and clear) errors? @Akababa tried lot of times (also with very good convergence and with very much rich and big PGN files) but the model never was able to get rid of all blunders (and reach nearly no strategy in a long range). Maybe there could be something profound and related with the model itself that makes it unable to learn in a good manner (something DeepMind didn't explain in its last paper or something like that).

Akababa commented 6 years ago

I actually think it's very reasonable for the model to blunder when trained on one-hot Stockfish moves in very balanced positions. I'd say this is a limitation of supervised training in this fashion, and as @glinscott pointed out training on MCTS visit counts would be more robust because it's a good local policy improvement operator. But iirc leelazero reached a decent level by this same method so it could simply be lack of MCTS playouts or chess requires more layers.

Also without the validation split it's impossible to measure overfitting.

gcp commented 6 years ago

With only 80k games there is very likely to be a vast overfit in the value layer. You can control for this by lowering the MSE weighting in the total loss formula. Or by having more games (I'm still generating a ton), which is far better.

From this discussion I don't see why you think the network will converge to always returning 0.5 (or 0 in -1...1 range) though. It will reach that point quickly, but it will also be able to see that when one side has much more material (which is a trivial function of the inputs), the losses drop heavily when predicting a win for that side. And so on.

It is harder/slower to train to imbalanced categories but it certainly isn't impossible.

gcp commented 6 years ago

But iirc leelazero reached a decent level by this same method so it could simply be lack of MCTS playouts or chess requires more layers.

It also had a ton more data! About 2.5M games times 8 rotations (not possible to use rotations in chess).

Zeta36 commented 6 years ago

@gcp, but once the model converge quickly (and deeply) into ignoring the input and saying always 0, the weights of the last FC layer will be so very small that any try to backpropagate any gradient would be almost negligible, isn't it?

"Go" does not have this (theoretical) problem since it has no zero result game (no draws).

About the number of games: @Akababa tried with a huge dataset of really big PGN files in here: https://github.com/Zeta36/chess-alpha-zero, and even he got a good convergence of mse and policy the model could play more or less "good" games although he was not able to remove all blunders nor to get a model able to play some strategic game in the long range.

What do you think?

gcp commented 6 years ago

but once the model converge quickly (and deeply) into ignoring the input and saying always 0, the weights of the last FC layer will be so very small that any try to backpropagate any gradient would be almost negligible, isn't it

But why would it converge that way? Not all games are drawn. The mispredictions on the won or lost games will still cause big gradients. The predictions of 0 on the drawn games will cause no gradient. Predicting 0.1 on a game that was drawn will produce a tiny gradient compared to mispredicting a win as a draw. Still plenty of room to make the distinctions, as there is strong pressure (and actual gradient direction) on the network to correctly predict the 40% of games that are not drawn.

"Go" does not have this (theoretical) problem since it has no zero result game (no draws).

Predicting 0 is still an easy way to get a quick loss on MSE error compared to always predicting a constant value, or predicting randomly. So I'm not sure this argument even works!

About the number of games: @Akababa tried with a huge dataset of really big PGN files in here

This page talks about "1000" games or "3000" games. You want 2 orders of magnitude more, or you will get MSE overfitting, as I already pointed out.

Zeta36 commented 6 years ago

@Akababa tried with a lot more than 3000 games (even when in the readme he only talks about 1000 games).

About the quick convergence:

Imagine you have a worker playing self-games (or a parser of PGN files). You get a chunk of 15.000 movements for example. In average you will have more or less 5000 movements with z=-1, 5000 with z=0, and 5000 z=1.

Then you run the optimization worker an it reads the 15.000 movements to backprop. The loss function is mse so there is a clear an easy(fast) way to reach a local minimum when the model ignore the input (the board) and always says 0 (the mean) as result (simply by taking the weights of the last FC layer to values very near 0).

In this case, the optimization will reach quickly a high accuracy rate of 1/3 (33%) and a low mse loss of 0.66. After very few steps the model would get stuck with this function and no further improvement could make the model to take him out of this fast (and deep) convergence (because the weights of the last FC layer will be so very small that any try to backpropagate any gradient would be almost negligible).

I'm just talking theoretically and if the dataset has not bias but an almost perfect proportion between win, loss and draws (-1,0,1) and if we use a really big dataset unable to be overfitted.

gcp commented 6 years ago

because the weights of the last FC layer will be so very small that any try to backpropagate any gradient would be almost negligible).

I just don't see why this would happen for the reasons already stated. And unless there were bugs, does the practical result from @glinscott not show that it does not?

With 40% of the games producing a strong gradient towards anything that remotely correlates with the population count of many input planes[1], and all draw games producing an ever tinier gradient towards "always 0", how could you get stuck deeply in a local minimum? It just sounds so weird.

[1] This is why my dataset has no resignations, FWIW.

Zeta36 commented 6 years ago

@glinscott results are based in a very biased dataset (with doubles z=1 respect to z=-1) and with a small an easy to overfit number of movements.

I'd be great to check this with a very much big number of movements and with an unbiased dataset (with near equals numbers of z=-1, 0 and 1).

Akababa commented 6 years ago

@Zeta36 What about using the flip-policy to regularize out bias? It's equivalent to x2 data augmentation (which is actually possible in chess @gcp) and possibly better. Although I'm not sure it's a huge concern to begun with as the model should be complex enough to recognize imbalanced positions so as long as you show it some black winning it should kill the bias in the long run. Training might be marginally faster with a maximal entropy training set though.

gcp commented 6 years ago

https://sjeng.org/dl/sftrain_clean.pgn.xz now has about 155k games.

Error323 commented 6 years ago

Ok @glinscott I have interesting results here. I changed the tensorflow training code to use a separate training and validation set. Sampled 500K positions from @gcp 's 80K games, converted them to raw tensorflow protobuf format and split them into 75% train and 25% test. Getting to an accuracy of 70% now using the default network (64 filters, 5 residual blocks). I changed the learning rate to 0.005 at ~100K steps. results Also, I think augmenting the data as @Akababa suggested by board flipping is a good idea. We're already flipping the board to the current color. I'll push my code changes after some cleaning. Finally I'll run some games against random.

Sample output:

step 137400, policy loss=1.88179 mse=0.0328728 reg=0.593448 (6585.84 pos/s)
step 137500, policy loss=1.88024 mse=0.0328149 reg=0.593368 (6593.21 pos/s)
step 137600, policy loss=1.88207 mse=0.0329415 reg=0.593287 (6589.3 pos/s) 
step 137700, policy loss=1.88264 mse=0.0329211 reg=0.593206 (6608.15 pos/s)
step 137800, policy loss=1.87931 mse=0.0328492 reg=0.593126 (6591.49 pos/s)
step 137900, policy loss=1.87931 mse=0.0328397 reg=0.593046 (6598.33 pos/s)
step 138000, policy loss=1.88183 mse=0.0329394 reg=0.592966 (6594.76 pos/s)
step 138000, training accuracy=68.9551%, mse=0.105712
Model saved in file: /home/fhuizing/Workspace/leela-chess/training/tf/leelaz-model-138000
Leela weights saved to /home/fhuizing/Workspace/leela-chess/training/tf/leelaz-model-138000.txt 
step 138100, policy loss=1.88048 mse=0.0328839 reg=0.592886 (5102.31 pos/s)
step 138200, policy loss=1.87727 mse=0.0328062 reg=0.592806 (6536.92 pos/s)

Zeta36 commented 6 years ago

Someone in reddit (https://www.reddit.com/r/MachineLearning/comments/7qdwb5/is_it_possible_to_train_the_value_output_in_a/) also commented about the problem of the value head tending to the mean (zero):

"I encountered a problem like this when I was developing my own chess implementation of AGZ (GitHub here:https://github.com/trebledawson/chess, but I'm not incredibly happy with how it turned out). My conclusion was that the size of my neural network was too small, and was thus underfitting the value output model. I attempted to increase the number of parameters until it was as large as my computer could store, but I still experienced the v-output tending toward the mean (i.e. 0)."

gcp commented 6 years ago

I changed the tensorflow training code to use a separate training and validation set.

I couldn't immediately tell from the code, but are you splitting on entire games? If you randomly partition positions, you'll need to make sure positions from training games aren't in the validation set.

(Else the network will just remember where the pieces ended up and what the game result was, and you won't detect that overfit. It'll easily do that at intermediate sizes.)

Error323 commented 6 years ago

I samled 500K games from the ChunkParser in parse.py see https://github.com/Error323/leela-chess/blob/master/training/tf/leela_to_proto.py. And used the first 75% for training and the last 25% for testing. There might be a bit of overlap, a game from each worker?

gcp commented 6 years ago

I guess it depends on the chunk sizes? The games aren't fed in in order: https://github.com/Error323/leela-chess/blob/f3646e28c6a544b519da8e8f9f7206f552354da1/training/tf/parse.py#L105

glinscott commented 6 years ago

Ok @glinscott I have interesting results here. I changed the tensorflow training code to use a separate training and validation set. Sampled 500K positions from @gcp 's 80K games, converted them to raw tensorflow protobuf format and split them into 75% train and 25% test. Getting to an accuracy of 70% now using the default network (64 filters, 5 residual blocks). I changed the learning rate to 0.005 at ~100K steps.

@Error323 wow, that is awesome! I guess I really limited myself with starting with a smaller subset of the games and keeping the weights from that run, the accuracy has still only made it up to 33%.

I added you on https://github.com/glinscott/lczero-weights, if you want to upload the weights there. Will be very interesting to see how it does against the version I trained. My hope is it destroys my weights :).

glinscott commented 6 years ago

I guess it depends on the chunk sizes?

@gcp the chunk sizes are 15000 positions per chunk for supervised learning currently: https://github.com/glinscott/leela-chess/blob/master/src/main.cpp#L308

Error323 commented 6 years ago

Let's see if I understand this correctly. A chunk during supervised training contains 15K samples, I spawned 11 workers, each worker processes a full chunk before moving to the next random one. Each next(gen) call in leela_to_proto.py samples a position from one of the 11 chunks. So on average there would be 11*(70/2) positions of game overlap, given an average of 70 ply per game, which corresponds to 0.0031%. What do you think @gcp

jkiliani commented 6 years ago

Would it be a reasonable assumption that leela-chess with a good supervised network should be able to compete with Giraffe (2400 Elo)?

gcp commented 6 years ago

I guess another way of saying this is that each chunk contains ~215 games, so assuming they're drawn from equally (not actually sure that is guaranteed?), we expect the first 161 to end up in training, and the rest in test, and maybe a game per chunk split between the two?

Doesn't sound like it would be much of an issue then.

Note that the overfit & validation stuff is mostly there to make judging things during supervised learning easier (which is why it was never completed with proper splits etc).

glinscott commented 6 years ago

Btw, playing the network that trained overnight (and only made it to 33% accuracy) is currently winning handily against the previous best version: 16-3-9. That's a great sign that the network is being used well by the search.

One thing we might need to do is introduce a minimalist opening book to the testing matches though. Google was using many threads to feed the 4 TPUs, while we are using only a few with one or two gpus. That doesn't add much non-determinism into the games when played without noise. It's enough to avoid duplicate games, but they take a while to diverge from each other (eg. 10-11 moves into the game). The cool thing is it has learned the Gruenfeld and Queen's gambit declined from raw SF games. That speaks well to SF I guess :).

Edit, the new network won 49-13-38, so definitely improving!

gcp commented 6 years ago

@glinscott That shouldn't be needed. The current code is missing a call, which is why it does not diverge well: https://github.com/glinscott/leela-chess/blob/9b1e319eda079b66bf9b742ed5005877ade1e541/src/UCTSearch.cpp#L133

Which should be calling randomize_first_proportionally: https://github.com/gcp/leela-zero/blob/07b47a2e74febb0d6e2b10597690c19404cdc48d/src/UCTSearch.cpp#L149

Note that in AZ this wasn't move number dependent, they left it always on for self-play and always off for testing matches. I think you want something like an UCI option to toggle this.

If you introduce an opening book you risk not finding out if 1.d4 is better than 1.e4 or the French better than the Caro-Kann, etc :-)

glinscott commented 6 years ago

@gcp Oh, interesting! I thought they only used the Dirichlet noise in AZ. That's my mistake, I will re-add that in. That probably hurt the training game generation a fair bit as well...

gcp commented 6 years ago

For reference, relevant section from the paper: "During training, each MCTS used 800 simulations. .... Moves are selected in proportion to the root visit count."

Earlier, there is "The search returns a vector π representing a probability distribution over moves, either proportionally or greedily with respect to the visit counts at the root state"

They don't say what the "greedily" mode is for, but logically it's what they use for evaluation games against other opponents. Note that with no Dirichlet noise and no proportional choice, the program is almost deterministic. Which means you can't play testing games between two networks, as they will repeat the same game. And guess what, one of the differences between the AlphaGo Zero and Alpha Zero is that the latter no longer plays testing games to promote a network...

jkiliani commented 6 years ago

You could probably raise playing strength quite a bit over the training setting (temperature=1) without making the program deterministic, with a temperature setting around 0.25 to 0.5. This would effectively suppress moves with very small visit counts from selection, while still giving moves with visit counts close to the PV a reasonable chance of being picked.

Error323 commented 6 years ago

Final result from training 93% I lowered the learningrate 3 times: 0.05, 0.005, 0.0005, 0.00005. ~~~Now I wanted to run some games with cutechess-cli, but the current master crashes for me @glinscott :( not sure why yet, was tinkering over ssh~~~ I'm a dummy, it works. I'll upload the weights. `

gcp commented 6 years ago

Accuracy of 93% sounds pretty crazy. It's actually predicting ~15 ply StockFish searches with 93% accuracy? How damn strong is this already?

Error323 commented 6 years ago

I don't know o_O I can't run lczero atm. But yeah it really does sound crazy, almost think I messed up somehow :sweat: @glinscott could you run a tournament please? I don't have time for the next 8 hrs.

@gcp are the games from your dataset diverse enough I wonder?

gcp commented 6 years ago

@gcp are the games from your dataset diverse enough I wonder?

What does "diverse" mean? It's using a book to force diverse openings but you won't get totally crazy 5 pieces hanging positions of course, as it's Stockfish vs Stockfish. Using weaker players would weaken the policy part of the network.

Error323 commented 6 years ago

Yeah I ment in diversity of it's openings. OK then it's good. I guess it'll be decent against minimax based engines, but contains huge gaps in its gametree knowledge?

Error323 commented 6 years ago

Against random:

Score of supervised vs random: 66 - 6 - 28 [0.800] 100 Elo difference: 240.82 +/- 65.32 Finished match

glinscott / leela-chess

Is it possible to train the value output in a supervised (or even self-play) manner as AZ paper explains? #20