gowthamkpr commented 4 years ago

This turned out to be a longer write-up than I anticipated. The main points are:

Keras' Theano Backend runs over 10X slower with batch normalization This issue does not exist with a Tensorflow backend. Issue #1309 seems to say the problem is fixed, though in my experience it persists I don't know whether my problem stems from: a. Theano's implementation of Batch Normalization b. Keras' use of Theano's Batch Normalization Procedures c. My use of Keras' use of Theano's Batch Normalization procedures I'm currently using fairly recent versions of Keras, Theano and cuDNN:

Using Theano backend. Using cuDNN version 7104 on context None Mapped name None to device cuda: GeForce GTX 1080 with Max-Q Design (0000:01:00.0)

keras.version '2.2.4' theano.version u'1.0.3'

When I run the following modified/simplified version of keras/examples/cifar10_resnet.py, I get a significant slowdown when batch normalization is used. The code is:

"""Adapted from cifar10_resnet.py"""

from future import print_function import argparse import keras from keras.layers import Dense, Conv2D, BatchNormalization, Activation from keras.layers import AveragePooling2D, Input, Flatten from keras.optimizers import Adam from keras.callbacks import ModelCheckpoint, LearningRateScheduler from keras.callbacks import ReduceLROnPlateau from keras.preprocessing.image import ImageDataGenerator from keras.regularizers import l2 from keras import backend as K from keras.models import Model from keras.datasets import cifar10 import numpy as np import os import pdb

def get_data(): """ Loads CIFAR10 Data and converts to numpy arrays for net"""

Load the CIFAR10 data.

(x_train, y_train), (x_test, y_test) = cifar10.load_data()

Normalize data.

x_train = x_train.astype('float32') / 255 x_test = x_test.astype('float32') / 255

Convert class vectors to binary class matrices.

num_classes = 10 y_train = keras.utils.to_categorical(y_train, num_classes) y_test = keras.utils.to_categorical(y_test, num_classes)

return (x_train, x_test, y_train, y_test) def resnet_layer(inputs, batch_norm): """2D Convolution-Batch Normalization-Activation stack builder

Arguments

inputs (tensor): input tensor from input image or previous layer
batch_norm (bool): whether to include batch normalization

Returns

x (tensor): tensor as input to the next layer

""" conv = Conv2D(16, kernel_size=3, strides=1, padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(1e-4))

x = inputs x = conv(x) if batch_norm: x = BatchNormalization()(x) x = Activation('relu')(x)

return x def resnet_v1(input_shape, batch_norm): """ResNet Version 1 Model builder [a]

Arguments

input_shape (tensor): shape of input image tensor
batch_norm (bool): whether to include batch normalization

Returns

model (Model): Keras model instance

"""

Start model definition.

inputs = Input(shape=input_shape) x = resnet_layer(inputs, batch_norm)

Instantiate the stack of residual units

for res_block in range(3): y = resnet_layer(x, batch_norm) x = keras.layers.add([x, y]) x = Activation('relu')(x)

Add classifier on top.

v1 does not use BN after last shortcut connection-ReLU

x = AveragePooling2D(pool_size=8)(x) y = Flatten()(x) outputs = Dense(10, activation='softmax', kernel_initializer='he_normal')(y)

Instantiate model.

model = Model(inputs=inputs, outputs=outputs) return model if name == 'main': ap = argparse.ArgumentParser() ap.add_argument("--bn", action='store_true', help="batch_normalization flag")

args = ap.parse_args() batch_norm_flag = args.bn

Get CIFAR10 data

x_train, x_test, y_train, y_test = get_data()

Build net

input_shape = x_train.shape[1:] model = resnet_v1(input_shape, batch_norm_flag) model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=1e-3), metrics=['accuracy'])

model.summary()

Run training, without data augmentation.

model.fit(x_train, y_train, batch_size=32, epochs=1, validation_data=(x_test, y_test), shuffle=True)

Score trained model.

scores = model.evaluate(x_test, y_test, verbose=1) print('Test loss:', scores[0]) print('Test accuracy:', scores[1]) The different (with respect to speed) results I get are as follows:

WITH Batchnorm $ python cifar10_resnet_batchnorm_test.py --bn Using Theano backend. Using cuDNN version 7104 on context None Mapped name None to device cuda: GeForce GTX 1080 with Max-Q Design (0000:01:00.0) Train on 50000 samples, validate on 10000 samples Epoch 1/1 50000/50000 [==============================] - 82s 2ms/step - loss: 1.7749 - acc: 0.3629 - val_loss: 1.5035 - val_acc: 0.4697 10000/10000 [==============================] - 4s 383us/step Test loss: 1.5035479030609131 Test accuracy: 0.4697

WITHOUT Batchnorm $ python cifar10_resnet_batchnorm_test.py Using Theano backend. Using cuDNN version 7104 on context None Mapped name None to device cuda: GeForce GTX 1080 with Max-Q Design (0000:01:00.0) Train on 50000 samples, validate on 10000 samples Epoch 1/1 50000/50000 [==============================] - 7s 132us/step - loss: 1.7122 - acc: 0.3863 - val_loss: 1.4815 - val_acc: 0.4750 10000/10000 [==============================] - 0s 27us/step Test loss: 1.481479389190674 Test accuracy: 0.475

This slowdown is by more than a factor of 10.

If I use Tensorflow, the timings are:

WITH Batch Norm 50000/50000 [==============================] - 10s 204us/step - loss: 1.5723 - acc: 0.4416 - val_loss: 1.3853 - val_acc: 0.5129 10000/10000 [==============================] - 1s 57us/step Test loss: 1.3852726093292236 Test accuracy: 0.5129

WITHOUT Batch Norm: 50000/50000 [==============================] - 10s 194us/step - loss: 1.7507 - acc: 0.3728 - val_loss: 1.5280 - val_acc: 0.4503 10000/10000 [==============================] - 1s 53us/step Test loss: 1.527961152267456 Test accuracy: 0.4503

So, the issue seems to be the use of batch norm with Theano. It's also puzzling that Theano does slightly worse (-0.5%) with batch norm, while Tensorflow does noticeably better with batch norm (+6%), but the two backends converge with more training epochs.

I note that issue #1309 seems to be about this same problem and seems to regard it as solved, as of Feb. 14, 2017. Yet, I'm still having this problem. Is it something I'm doing, an issue with Keras' interfacing with Theano or an issue with Theano's implementation of Batch Normalization?

(Note: This is a distillation of an issue I raised in #12173, where I noted that Tensorflow does not experience this slowdown. But, I closed that issue and opened this one, since I am now able to more precisely state what the issue I've run into is. I hope this is the correct protocol for redefining an issue after further study)