maxpumperla / deep_learning_and_the_game_of_go

Code and other material for the book "Deep Learning and the Game of Go"
https://www.manning.com/books/deep-learning-and-the-game-of-go
953 stars 387 forks source link

CH7: Small Conv Network Training Error - Conv2DCustomBackpropInputOp only supports NHWC. #80

Open natekester opened 3 years ago

natekester commented 3 years ago

Attempting to run the small convolution network on MacOS Big Sur.

Not sure what my issue is exactly - could be versions used. Any ideas what I can do to make it work?

tensorflow==2.4.0 Python 3.8.2

`... Epoch 1/5 Traceback (most recent call last): File "training_small.py", line 37, in model.fit_generator(generator=generator.generate(batch_size, num_classes), epochs=epochs, steps_per_epoch=generator.get_num_samples() / batch_size, validation_data=test_generator.generate(batch_size, num_classes), validation_steps=test_generator.get_num_samples() / batch_size, callbacks=[ ModelCheckpoint('../checkpoints/small_modelepoch{epoch}.h5')]) File "/Library/Python/3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1847, in fit_generator return self.fit( File "/Library/Python/3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1100, in fit tmp_logs = self.train_function(iterator) File "/Library/Python/3.8/site-packages/tensorflow/python/eager/def_function.py", line 828, in call result = self._call(*args, *kwds) File "/Library/Python/3.8/site-packages/tensorflow/python/eager/def_function.py", line 888, in _call return self._stateless_fn(args, **kwds) File "/Library/Python/3.8/site-packages/tensorflow/python/eager/function.py", line 2942, in call return graph_function._call_flat( File "/Library/Python/3.8/site-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat return self._build_call_outputs(self._inference_function.call( File "/Library/Python/3.8/site-packages/tensorflow/python/eager/function.py", line 555, in call outputs = execute.execute( File "/Library/Python/3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, tensorflow.python.framework.errors_impl.InvalidArgumentError: Conv2DCustomBackpropInputOp only supports NHWC. [[node gradient_tape/sequential/conv2d_3/Conv2D/Conv2DBackpropInput (defined at training_small.py:37) ]] [Op:__inference_train_function_781]

Function call stack: train_function `

macfergus commented 3 years ago

Hi Nate, I actually just ran into this same problem myself recently. This is an issue of the channels_first/channels_last options for indexing tensors (also known as NCHW / NHWC). (See appendix A for a brief discussion). Unfortunately, TensorFlow has dropped some support for channels_first indexing, I believe since TF 2.0

The two options are:

  1. Downgrade TensorFlow and Keras to 1.8.x and 2.2.x, respectively -- those are the versions we used while writing the book; or
  2. Search and replace channels_first for channels_last in the code (pretty much anywhere you create a Conv2D layer)

Either one ought to fix it -- let us know if that works for you!

natekester commented 3 years ago

Hi Kevin, That did it! Super fast response. I appreciate it. Great content btw.

natekester commented 3 years ago

Hi Kevin,

in attempting to replicate the model training - I ran the 7.3 code (with the change of channels_last) with the small network layers, and I keep getting a result similar to the following:

Epoch 1/5
2672/2672 [==============================] - 94s 35ms/step - loss: 5.8858 - accuracy: 0.0041 - val_loss: 5.8737 - val_accuracy: 0.0041
Epoch 2/5
2672/2672 [==============================] - 96s 36ms/step - loss: 5.8662 - accuracy: 0.0039 - val_loss: 5.8437 - val_accuracy: 0.0041
Epoch 3/5
2672/2672 [==============================] - 113s 42ms/step - loss: 5.8413 - accuracy: 0.0039 - val_loss: 5.8327 - val_accuracy: 0.0043
Epoch 4/5
2672/2672 [==============================] - 103s 38ms/step - loss: 5.8230 - accuracy: 0.0042 - val_loss: 5.7696 - val_accuracy: 0.0039
Epoch 5/5
2672/2672 [==============================] - 103s 39ms/step - loss: 5.7610 - accuracy: 0.0046 - val_loss: 5.7244 - val_accuracy: 0.0057

I noticed that the cycles are very different - i.e. under epoch it has 2672/2672 instead of 12288/12288. Is that a random factor relative to the 100 (num_games) games it selects?

How would I go about getting the accuracy seen in the book?

Nkonovalenko commented 3 years ago

Hi Kevin, I would like to bump this issue. I've changed the channels_first into channels_last, but with a num_games=100 and epochs=5, I only get an accuracy of 0.004. Do you have any recommendations for which files to look through? I'm guessing this is due to a typo on my part, but my processor, parallel_processor, and small are all the same.

macfergus commented 3 years ago

Hello @Nkonovalenko, please see this writeup here: https://kferg.dev/posts/2021/deep-learning-and-the-game-of-go-training-results-from-chapter-7/

Hopefully that gets you unblocked!

constant5 commented 3 years ago

I am trying to get this to run on colab with a tpu, unfortunately the generator in the code base is not compatible with distribution across the tpu cluster. I solved this by just loading the dataset using generator=False. My problem is that the network is quickly overfitting. I guess increasing the number of games should help with this?

Nkonovalenko commented 3 years ago

Hello @Nkonovalenko, please see this writeup here: https://kferg.dev/posts/2021/deep-learning-and-the-game-of-go-training-results-from-chapter-7/

Hopefully that gets you unblocked!

Thank you so much, it did!

macfergus commented 3 years ago

@constant5 The generator version creates large temporary files on disk, so I suspect that's why it won't work with colab (just guessing though).

As for the overfitting, more games is a good idea. I'd say around 10,000 games is the minimum to train a network that is useful for actual game play. And more is better. Not sure what the memory constraints are in colab, but you may have to modify the code to chunk it up yourself.

constant5 commented 3 years ago

This may have not been the most efficient way to do it but after I wrote the consolidated NumPy files to disk I rewrote the data to tf records:

X_train = np.load('data/train_features.npy',mmap_mode='c')
y_train = np.load('data/train_labels.npy',mmap_mode='c')

X_test = np.load('data/test_features.npy',mmap_mode='c')
y_test = np.load('data/test_labels.npy',mmap_mode='c')

def int64_feature(value):
    """Returns an int64_list from a bool / enum / int / uint."""
    return tf.train.Feature(int64_list=tf.train.Int64List(value=value))

def tf_record_save(X, y, type='train',num_samples=9920):
  """writes a numpy array to tfrecords of ~100Mb"""
  num_tfrecords = len(X) // num_samples

  if num_tfrecords == 0: #for arrays smaller than the default num_samples
    num_samples = len(X)
    num_tfrecords = 1

  for tfrec_num in range(num_tfrecords):
      features = X[(tfrec_num * num_samples) : ((tfrec_num + 1) * num_samples)]
      labels = y[(tfrec_num * num_samples) : ((tfrec_num + 1) * num_samples)]
      fname = f"data/tf_records/{type}_file_{tfrec_num}-{num_tfrecords}.tfrec"
      with tf.io.TFRecordWriter(fname) as writer:
          print('Writing ', fname,'...')
          for X,y in zip(features, labels):
              X = np.array(X).flatten().astype(int)
              y =np.array(y).flatten().astype(int)
              example = create_example(X, y)
              writer.write(example.SerializeToString())

tf_record_save(X_train,  y_train, type='train')
tf_record_save(X_test,  y_test, type='test')

Then I created a tfrecords data generator:

def data_input_fn(filenames, batch_size=1024):

  def _parse_tfrecord_fn(example):
    feature_description = {
        "go_board": tf.io.FixedLenFeature((19*19,), tf.int64),
        "move": tf.io.FixedLenFeature((19*19,), tf.int64)
    }
    example = tf.io.parse_single_example(example, feature_description)
    return example   

  def _prepare_sample(features):
    X = tf.reshape(features["go_board"], (19,19,1))
    y = tf.reshape(features["move"], (19*19,))
    return X, y

  def get_dataset(filenames, batch_size):

    AUTOTUNE = tf.data.experimental.AUTOTUNE
    dataset = (
        tf.data.TFRecordDataset(filenames, num_parallel_reads=AUTOTUNE)
        .map(_parse_tfrecord_fn, num_parallel_calls=AUTOTUNE)
        .map(_prepare_sample, num_parallel_calls=AUTOTUNE)
        .shuffle(batch_size * 10)
        .batch(batch_size)
        .prefetch(AUTOTUNE)
    )
    return dataset.cache()

  return get_dataset(filenames, batch_size)

train_data = data_input_fn(train_list)
test_data = data_input_fn(test_list)

This works well for colab and GPU training but not for TPU because the TPU does not support local file sharding.

Aquietzero commented 8 months ago

Hello @Nkonovalenko, please see this writeup here: https://kferg.dev/posts/2021/deep-learning-and-the-game-of-go-training-results-from-chapter-7/

Hopefully that gets you unblocked!

hi, i read the writeup and set num_game = 1000 and epochs = 50, but still can't get the expected accuracy. The first several epochs shows a slow enhancement but after 20 epochs the loss increases till the end. It's hard to figure out what cause the problem. Below shows some training logs.

1480/1480 [==============================] - 98s 66ms/step - loss: 5.8798 - accuracy: 0.0032 - val_loss: 5.8580 - val_accuracy: 0.0041
Epoch 2/50
1480/1480 [==============================] - 98s 66ms/step - loss: 5.8061 - accuracy: 0.0042 - val_loss: 5.7306 - val_accuracy: 0.0057
Epoch 3/50
1480/1480 [==============================] - 99s 67ms/step - loss: 5.6749 - accuracy: 0.0067 - val_loss: 5.6105 - val_accuracy: 0.0081
Epoch 4/50
1480/1480 [==============================] - 98s 66ms/step - loss: 5.5971 - accuracy: 0.0081 - val_loss: 5.5554 - val_accuracy: 0.0098
Epoch 5/50
1480/1480 [==============================] - 98s 66ms/step - loss: 5.5572 - accuracy: 0.0097 - val_loss: 5.5232 - val_accuracy: 0.0107
...
...
Epoch 46/50
1480/1480 [==============================] - 101s 68ms/step - loss: 19.7151 - accuracy: 0.0419 - val_loss: 17.9388 - val_accuracy: 0.0496
Epoch 47/50
1480/1480 [==============================] - 99s 67ms/step - loss: 21.2528 - accuracy: 0.0432 - val_loss: 16.9957 - val_accuracy: 0.0521
Epoch 48/50
1480/1480 [==============================] - 102s 69ms/step - loss: 21.7980 - accuracy: 0.0436 - val_loss: 13.3183 - val_accuracy: 0.0501
Epoch 49/50
1480/1480 [==============================] - 104s 70ms/step - loss: 21.2861 - accuracy: 0.0451 - val_loss: 15.7350 - val_accuracy: 0.0518
Epoch 50/50
1480/1480 [==============================] - 99s 67ms/step - loss: 24.7583 - accuracy: 0.0452 - val_loss: 13.1013 - val_accuracy: 0.0525

Though whether reproducing the result or not is not a blocking point of further reading the book, i still want to get a similar result for a check point. Any hint to check?

@Nkonovalenko you mentioned that you did it. So you just reproduce the result after changing only the num_games and epoch?