Chapter 7 training reproducibility

danphan commented 1 year ago

Hi @macfergus and @maxpumperla,

Thanks for writing such a lovely book! I have some (hopefully) quick questions related to the discussion in Ch. 7, where you walk us through how to train a CNN to play go. To be specific, on page 168 in Ch. 7, the results of training a CNN (using a one-plane encoder) on a sample of 100 games are presented, the results of which I reproduce below:

Epoch 1/5 12288/12288 [==============================] - 14053s 1s/step - loss: 3.5514 ➥ - acc: 0.2834 - val_loss: 2.5023 - val_acc: 0.6669 Epoch 2/5 12288/12288 [==============================] - 15808s 1s/step - loss: 0.3028 ➥ - acc: 0.9174 - val_loss: 2.2127 - val_acc: 0.8294 Epoch 3/5 12288/12288 [==============================] - 14410s 1s/step - loss: 0.0840 ➥ - acc: 0.9791 - val_loss: 2.2512 - val_acc: 0.8413 Epoch 4/5 12288/12288 [==============================] - 14620s 1s/step - loss: 0.1113 ➥ - acc: 0.9832 - val_loss: 2.2832 - val_acc: 0.8415 Epoch 5/5 12288/12288 [==============================] - 18688s 2s/step - loss: 0.1647 ➥ - acc: 0.9816 - val_loss: 2.2928 - val_acc: 0.8461

My first question is, why are there so many steps per epoch? If we are using 100 games, and each game takes around 100 to 200 moves, then there should be on the order of 10,000 moves/instances in our training set. In the code, the batch size is 128, so we expect 10,000/128 ~ 100 steps per epoch, right? When I try to reproduce these results, tensorflow shows me 77 steps per epoch rather than 12,288, which is more in line with my expectations. Instead, the number of steps seems to be on the order of the number of moves in the training set (which I can only see happening if one doesn't explicitly set the steps_per_epoch argument in model.fit().)

My second question is related to this epoch question. On page 167, it says

Note that if you run this code yourself, you should be aware of the time it may take to complete this experiment. If you run this on a CPU, training an epoch might take a few hours.

However, when training my network, I find that going through one epoch takes less than a minute. I assume that this has to do with the discrepancy mentioned above (77 vs 12,288) steps per epoch.

Lastly, I find that my accuracy is substantially smaller than what is reported in the book. For reference, here is the output of my code, where I've trained (what should be) the same model:

Epoch 1/5 77/77 [==============================] - 58s 741ms/step - loss: 5.8893 ➥ - accuracy: 0.0036 - val_loss: 5.8886 - val_accuracy: 0.0042 Epoch 2/5 77/77 [==============================] - 51s 660ms/step - loss: 5.8884 ➥ - accuracy: 0.0036 - val_loss: 5.8878 - val_accuracy: 0.0041 Epoch 3/5 77/77 [==============================] - 52s 676ms/step - loss: 5.8876 ➥ - accuracy: 0.0042 - val_loss: 5.8869 - val_accuracy: 0.0040 Epoch 4/5 77/77 [==============================] - 55s 710ms/step - loss: 5.8867 ➥ - accuracy: 0.0045 - val_loss: 5.8860 - val_accuracy: 0.0049 Epoch 5/5 77/77 [==============================] - 58s 755ms/step - loss: 5.8858 ➥ - accuracy: 0.0052 - val_loss: 5.8850 - val_accuracy: 0.0053

The main difference is that I've modified the code to work with tensorflow 2; however all changes were very minimal, and I don't believe this is the main concern.

Thanks for the help! Dan

macfergus commented 1 year ago

Hi @danphan, glad you are enjoying the book. Unfortunately, there is a mistake in the listings in chapter 7: the sample output doesn't match the settings in the code listing. This page has a full explanation: https://kferg.dev/posts/2021/deep-learning-and-the-game-of-go-training-results-from-chapter-7/

Let me know if that helps

danphan commented 1 year ago

Thanks for your response, Kevin!

I'm glad to hear that I didn't massively screw things up somehow. After simply bumping up num_games and going for more epochs, I have found that the loss and accuracy have improved dramatically. For any other people reading this, I find that the most important factor for me was simply switching from using sgd to adam for the optimizer. While adagrad is better than plain sgd for optimization, I find both to be extremely slow compared to adam.

maxpumperla / deep_learning_and_the_game_of_go

Chapter 7 training reproducibility #108