Example train_generator.py does not achieve the accuracy it shoud

Tuxius commented 5 years ago

Great fun reading this book :-)

Running the example train_generator.py of chapter 7.3 directly from the repository does not nearly achieve the accuracy of 98% mentioned in the book:

` /mnt/sambashare/wip3/code# python3 train_generator.py Using TensorFlow backend.

Reading cached index page KGS-2019_03-19-1478-.tar.gz 1478 KGS-2019_02-19-1412-.tar.gz 1412 ... KGS-2003-19-7582-.tar.gz 7582 KGS-2002-19-3646-.tar.gz 3646 KGS-2001-19-2298-.tar.gz 2298 2019-05-05 18:59:54.645058: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2019-05-05 18:59:55.228716: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-05-05 18:59:55.229184: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.62 pciBusID: 0000:01:00.0 totalMemory: 7.76GiB freeMemory: 7.66GiB 2019-05-05 18:59:55.229201: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-05-05 18:59:55.444275: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-05-05 18:59:55.444323: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-05-05 18:59:55.444330: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-05-05 18:59:55.444587: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7373 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5) Epoch 1/5 96/96 [==============================] - 3s 29ms/step - loss: 5.8890 - acc: 0.0032 - val_loss: 5.8879 - val_acc: 0.0028 Epoch 2/5 96/96 [==============================] - 2s 18ms/step - loss: 5.8877 - acc: 0.0028 - val_loss: 5.8867 - val_acc: 0.0029 Epoch 3/5 96/96 [==============================] - 1s 15ms/step - loss: 5.8864 - acc: 0.0032 - val_loss: 5.8853 - val_acc: 0.0036 Epoch 4/5 96/96 [==============================] - 2s 18ms/step - loss: 5.8850 - acc: 0.0036 - val_loss: 5.8837 - val_acc: 0.0033 Epoch 5/5 96/96 [==============================] - 2s 17ms/step - loss: 5.8833 - acc: 0.0031 - val_loss: 5.8817 - val_acc: 0.0033`

I can't see why - any ideas are highly welcome!

Best

macfergus commented 5 years ago

Hi @Tuxius, I can get a 90%+ accuracy on the train_generator example, but I had to run it for many more epochs. Just increase the epochs variable on line 41. At around 180 epochs I get over 90% accuracy. If I also switch the optimizer from sgd to adadelta (see page 171), I can hit that same accuracy target a lot faster.

However, it is important to keep in mind that it's only capable of hitting that super-high accuracy because it's essentially memorizing a tiny dataset. With a larger training set, the observed accuracy will be a lot lower, but it should generalize to new games better, and therefore it will be more useful for game play.

Tuxius commented 5 years ago

Hi macfergus,

thanks, I tried, it helped to some extent, but I still can't achieve your result or the book's. With epochs = 200 I only got a 37% val_acc:

...
Epoch 1/200

    1/76936 [..............................] - ETA: 29:23:50 - loss: 5.8886 - acc: 0.0000e+00
    5/76936 [..............................] - ETA: 6:07:39 - loss: 5.8878 - acc: 0.0031
...
76936/76936 [==============================] - 1284s 17ms/step - loss: 1.9210 - acc: 0.5041 - val_loss: 2.7838 - val_acc: 0.3716

I can't see why?

macfergus commented 5 years ago

What was your value for num_games? With num_games = 100 I get this output:

...
Epoch 181/1000
88/88 [==============================] - 1s 11ms/step - loss: 0.3383 - acc: 0.9156 - val_loss: 0.3640 - val_acc: 0.9150
...

Note the "88/88" indicating there are only 88 positions in the training set -- which means it's easy for the network to memorize, but too small to really be useful. So that accuracy only demonstrates that everything is wired up correctly.

In your example I see 76936 positions in your training set, so it's a harder training set to memorize, but also a more realistic problem. 37% validation accuracy is pretty reasonable real-life accuracy. (I think it's probably possible to get close to 50%, but not much more than that, using the same general techniques covered in the book)

Tuxius commented 5 years ago

Thanks, yes, I can replicate that result if I only take 88 positions:

Epoch 180/180 88/88 [==============================] - 1s 15ms/step - loss: 0.3280 - acc: 0.9124 - val_loss: 0.3290 - val_acc: 0.9149

You say that is because the network memorized the 88 positions. But why is then the val_acc better than 90%? Shouldn't the validation be independent of the learning sample set?

maxpumperla / deep_learning_and_the_game_of_go

Example train_generator.py does not achieve the accuracy it shoud #33