When I set `padding= same` in MaxPooling2D/AveragePooling2D, different parameter configurations will trigger inconsistent results on different backends, especially coping with diffrent `strides` values. #13841
Have I written custom code (as opposed to using example directory):
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 & Linux Ubuntu 18.04
Tensorflow backend (yes / no): yes
Tensorflow version:1.15.0
Cntk version: 2.7
Theano version: 1.0.4
Keras version: 2.3.1
Python version: 3.6.9
CUDA/cuDNN version: -
GPU model and memory: -
Describe the current behavior
When setting padding = same in MaxPooling2D and AveragePooling2D, we can observe obvious output inconsistencies between different backends under different combinations of poolsize and strides, given the same layer input and weights. For example, when setting pool_size = 2 and strides = 1 in AveragePooling2D, Theano behaves quite differently with CNTK and Tensorflow.
To alleviate the impact of randomness, we conducted several repeated tests under the same configuration, and got similar results. Using AveragePooling2D as an example, below shows the detailed output inconsistencies under one of these trials. Note that, we take the sum of absolute diffenece (SoAD) as the metric below (See SoAD implementation in code snippet below).
1) pool_size=1 on AveragePooling2D
strides
Tensorflow-CNTK
Tensorflow-Theano
CNTK-Theano
1
0
0
0
2
0
0
0
3
0
0
0
4
3486.9348074873396
0
3486.9348074873396
5
0
0
0
6
0
0
0
7
1352.1914413716454
0
1352.1914413716454
8
894.2556553477407
0
894.2556553477407
9
872.0831933177874
0
872.0831933177874
10
0
0
0
2) pool_size=2 on AveragePooling2D
strides
Tensorflow-CNTK
Tensorflow-Theano
CNTK-Theano
1
0
23810.60574585
23810.60574585
2
0
6086.89803055
6086.89803055
3
0
2845.84598675
2845.84598675
4
1406.43844484
1562.48328647
1818.60488969
5
0
1177.9917318
1177.9917318
6
0
881.51344889
881.51344889
7
575.09660981
594.96606259
747.960241
8
426.97600164
413.47323987
490.51357441
9
408.0700912
353.56068743
433.0762342
10
0
404.73061594
404.73061594
3) pool_size=3 on AveragePooling2D
strides
Tensorflow-CNTK
Tensorflow-Theano
CNTK-Theano
1
0
4.28967972254668e-12
4.28967972254668e-12
2
3341.4175014497478
3341.417501449748
1.0436096431476471e-12
3
1591.7682246308545
1591.7682246308545
4.551914400963142e-13
4
0
846.7093040097947
846.7093040097947
5
659.33280348679
659.3328034867901
1.8252066524837574e-13
6
475.1066414369499
475.1066414369499
1.021405182655144e-13
7
0
350.878200849514
350.878200849514
8
277.9826966410767
211.56357127315164
304.70752801483616
9
214.17734053203276
201.9044476454168
278.40342817756334
10
219.61688923217574
219.61688923217577
5.062616992290714e-14
From above results we can see, when setting pool_size=1, CNTK performs differently with Tensorfow or Theano in some senarios. And when setting pool_size=2or pool_size=3, more diverse inconsistencies can be trigger arcoss all of the 3 backends.
We also do the same test on Maxpooling2D and get similar results. The following table only shows the situation where pool_size =1.
4) pool_size=1 on Maxpooling2D
strides
Tensorflow-CNTK
CNTK-Theano
Tensorflow-Theano
1
0
0
0
2
0
0
0
3
0
0
0
4
3409
3409
0
5
0
0
0
6
0
0
0
7
1199
1199
0
8
862
862
0
9
841
841
0
10
0
0
0
Key insights
Note that, the above issues only occur when padding = same. When setting padding=valid, all backends generate the same outputs. Based on that, we think the three backends follow different implementations in case of the padding=same, especially when coping with the strides option.
Perfomance issue in applications
Even worse, the above implementation gap may result in severe performace issue in practice. For example, we train a simple model (i.e., LeNet-5) on MNIST using Tensorflow as backend, and then switch across the 3 backends to conduct prediction. We slightly change the strides value in the layer MaxPooling2D, and get obvious gap in terms of prediction accuray. Below are the configuration, prediction results , and model architecture of this performance issue.
1) Configurations
Case
Configurations
Case0
pool_size=(2,2), strides=(2,2), padding='valid'
Case1
pool_size=(2,2), strides=(2,2), padding='same'
Case2
pool_size=(2,2), strides=(3,3), padding='same'
Case3
pool_size=(2,2), strides=(4,4), padding='same'
Case4
pool_size=(2,2), strides=(3,3), padding='valid'
2) Prediction accuracy(%)
Tensorflow
CNTK
Theano
Case0
99.18
99.18
99.18
Case1
99.02
99.02
85.17
Case2
99.04
99.04
83.57
Case3
98.83
70.23
82.83
Case4
99.00
99.00
99.00
3) Model architecture
from keras.layers import Input, Conv2D, Dense, Flatten, MaxPooling2D, Activation
from keras.regularizers import l2
from keras.models import Model
class model:
def __init__(self, input_shape, cls_num=10):
self.name = 'LeNet-5'
self.input_shape = input_shape
self.cls_num = cls_num
self.kernel_size = (5,5)
self.weight_decay = 0.0001
def build_model(self):
input = Input(shape=self.input_shape)
x = Conv2D(6, self.kernel_size, padding='valid', activation='relu', kernel_initializer='he_normal',
kernel_regularizer=l2(self.weight_decay))(input)
x = MaxPooling2D(pool_size=(2, 2))(x)
x = Conv2D(16, self.kernel_size, padding='valid', activation='relu', kernel_initializer='he_normal',
kernel_regularizer=l2(self.weight_decay))(x)
x = MaxPooling2D((2, 2), strides=(2, 2),padding='valid')(x) # Case0
#x = MaxPooling2D((2, 2), strides=(2, 2), padding='same')(x) # Case1
#x = MaxPooling2D((2, 2), strides=(3, 3), padding='same')(x) # Case2
#x = MaxPooling2D((2, 2), strides=(4, 4), padding='same')(x) # Case3
#x = MaxPooling2D((2, 2), strides=(3, 3), padding='valid')(x) # Case4
x = Flatten()(x)
x = Dense(120, activation='relu', kernel_initializer='he_normal', kernel_regularizer=l2(self.weight_decay))(x)
x = Dense(84, activation='relu', kernel_initializer='he_normal', kernel_regularizer=l2(self.weight_decay))(x)
x = Dense(self.cls_num, kernel_initializer='he_normal', kernel_regularizer=l2(self.weight_decay))(x)
x = Activation('softmax')(x)
lenet5 = Model(input, x)
return lenet5
Code to reproduce the issue
import os
import numpy as np
import keras.layers as L
import keras.backend as K
import importlib
from keras.models import load_model
from keras.engine import Model, Input
backends = ['cntk','tensorflow','theano']
# SoAD, Sum of Absolute Difference
def acc_abs_diff(output1, output2):
assert output1.shape == output2.shape
abs_diff = np.abs(output1-output2)
return np.sum(abs_diff)
def set_keras_backend(backend='tensorflow'):
if K.backend() != backend:
os.environ['KERAS_BACKEND'] = backend
importlib.reload(K.load_backend)
importlib.reload(K)
assert K.backend() == backend
kwargs={'pool_size': 1, 'padding': 'same', 'strides': 1, 'data_format': 'channels_last'}
listdiff=[]
for stride_value in range(10):
kwargs['strides']=stride_value+1
input_data = (10 * np.random.random((1,32,32,16)))
input = input_data.astype('float32')
set_keras_backend('tensorflow')
layer = L.pooling.MaxPool2D(**kwargs)# you can also use AveragePooling2D
x = Input(batch_shape=input.shape)
y = layer(x)
bk_model = Model(x, y)
model_path = os.path.join(yourpath, 'model.h5')
bk_model.save(model_path, bk_model)
output = {}
for bk in backends:
try:
set_keras_backend(backend=bk)
model = load_model(model_path)
output[bk] = model.predict(input)
except:
print('error result')
try:
diff1 = acc_abs_diff(output['tensorflow'], output['cntk'])
except:
diff1 = None
try:
diff2 = acc_abs_diff(output['theano'], output['cntk'])
except:
diff2 = None
try:
diff3 = acc_abs_diff(output['theano'], output['tensorflow'])
except:
diff3 = None
listdiff.append([diff1, diff2, diff3])
arraydiff = np.array(listdiff)
print(arraydiff)
print('finish')
System information
Describe the current behavior
When setting
padding = same
inMaxPooling2D
andAveragePooling2D
, we can observe obvious output inconsistencies between different backends under different combinations ofpoolsize
andstrides
, given the same layer input and weights. For example, when settingpool_size = 2
andstrides = 1
inAveragePooling2D
, Theano behaves quite differently with CNTK and Tensorflow.To alleviate the impact of randomness, we conducted several repeated tests under the same configuration, and got similar results. Using
AveragePooling2D
as an example, below shows the detailed output inconsistencies under one of these trials. Note that, we take the sum of absolute diffenece (SoAD) as the metric below (See SoAD implementation in code snippet below).1)
pool_size=1
onAveragePooling2D
2)
pool_size=2
onAveragePooling2D
3)
pool_size=3
onAveragePooling2D
From above results we can see, when setting
pool_size=1
, CNTK performs differently with Tensorfow or Theano in some senarios. And when settingpool_size=2
orpool_size=3
, more diverse inconsistencies can be trigger arcoss all of the 3 backends.We also do the same test on
Maxpooling2D
and get similar results. The following table only shows the situation wherepool_size =1
.4)
pool_size=1
onMaxpooling2D
Key insights
Note that, the above issues only occur when
padding = same
. When settingpadding=valid
, all backends generate the same outputs. Based on that, we think the three backends follow different implementations in case of thepadding=same
, especially when coping with thestrides
option.Perfomance issue in applications
Even worse, the above implementation gap may result in severe performace issue in practice. For example, we train a simple model (i.e., LeNet-5) on MNIST using Tensorflow as backend, and then switch across the 3 backends to conduct prediction. We slightly change the
strides
value in the layerMaxPooling2D
, and get obvious gap in terms of prediction accuray. Below are the configuration, prediction results , and model architecture of this performance issue.1) Configurations
2) Prediction accuracy(%)
3) Model architecture
Code to reproduce the issue