When I set `padding= same` in MaxPooling2D/AveragePooling2D, different parameter configurations will trigger inconsistent results on different backends, especially coping with diffrent `strides` values.

System information

Have I written custom code (as opposed to using example directory):
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 & Linux Ubuntu 18.04
Tensorflow backend (yes / no): yes
Tensorflow version：1.15.0
Cntk version: 2.7
Theano version: 1.0.4
Keras version: 2.3.1
Python version: 3.6.9
CUDA/cuDNN version: -
GPU model and memory: -

Describe the current behavior

When setting padding = same in MaxPooling2D and AveragePooling2D, we can observe obvious output inconsistencies between different backends under different combinations of poolsize and strides, given the same layer input and weights. For example, when setting pool_size = 2 and strides = 1 in AveragePooling2D, Theano behaves quite differently with CNTK and Tensorflow.

To alleviate the impact of randomness, we conducted several repeated tests under the same configuration, and got similar results. Using AveragePooling2D as an example, below shows the detailed output inconsistencies under one of these trials. Note that, we take the sum of absolute diffenece (SoAD) as the metric below (See SoAD implementation in code snippet below).

1) pool_size=1 on AveragePooling2D

strides	Tensorflow-CNTK	CNTK-Theano
1	0	0
2	0	0
3	0	0
4	3486.9348074873396	3486.9348074873396
5	0	0
6	0	0
7	1352.1914413716454	1352.1914413716454
8	894.2556553477407	894.2556553477407
9	872.0831933177874	872.0831933177874
10	0	0

2) pool_size=2 on AveragePooling2D

strides	Tensorflow-CNTK	Tensorflow-Theano	CNTK-Theano
1	0	23810.60574585	23810.60574585
2	0	6086.89803055	6086.89803055
3	0	2845.84598675	2845.84598675
4	1406.43844484	1562.48328647	1818.60488969
5	0	1177.9917318	1177.9917318
6	0	881.51344889	881.51344889
7	575.09660981	594.96606259	747.960241
8	426.97600164	413.47323987	490.51357441
9	408.0700912	353.56068743	433.0762342
10	0	404.73061594	404.73061594

3) pool_size=3 on AveragePooling2D

strides	Tensorflow-CNTK	Tensorflow-Theano	CNTK-Theano
1	0	4.28967972254668e-12	4.28967972254668e-12
2	3341.4175014497478	3341.417501449748	1.0436096431476471e-12
3	1591.7682246308545	1591.7682246308545	4.551914400963142e-13
4	0	846.7093040097947	846.7093040097947
5	659.33280348679	659.3328034867901	1.8252066524837574e-13
6	475.1066414369499	475.1066414369499	1.021405182655144e-13
7	0	350.878200849514	350.878200849514
8	277.9826966410767	211.56357127315164	304.70752801483616
9	214.17734053203276	201.9044476454168	278.40342817756334
10	219.61688923217574	219.61688923217577	5.062616992290714e-14

From above results we can see, when setting pool_size=1, CNTK performs differently with Tensorfow or Theano in some senarios. And when setting pool_size=2or pool_size=3, more diverse inconsistencies can be trigger arcoss all of the 3 backends.

We also do the same test on Maxpooling2D and get similar results. The following table only shows the situation where pool_size =1.

4) pool_size=1 on Maxpooling2D

strides	Tensorflow-CNTK	CNTK-Theano
1	0	0
2	0	0
3	0	0
4	3409	3409
5	0	0
6	0	0
7	1199	1199
8	862	862
9	841	841
10	0	0

Key insights

Note that, the above issues only occur when padding = same. When setting padding=valid, all backends generate the same outputs. Based on that, we think the three backends follow different implementations in case of the padding=same, especially when coping with the strides option.

Perfomance issue in applications

Even worse, the above implementation gap may result in severe performace issue in practice. For example, we train a simple model (i.e., LeNet-5) on MNIST using Tensorflow as backend, and then switch across the 3 backends to conduct prediction. We slightly change the strides value in the layer MaxPooling2D, and get obvious gap in terms of prediction accuray. Below are the configuration, prediction results , and model architecture of this performance issue.

1) Configurations

Case	Configurations
Case0	pool_size=(2,2), strides=(2,2), padding='valid'
Case1	pool_size=(2,2), strides=(2,2), padding='same'
Case2	pool_size=(2,2), strides=(3,3), padding='same'
Case3	pool_size=(2,2), strides=(4,4), padding='same'
Case4	pool_size=(2,2), strides=(3,3), padding='valid'

2) Prediction accuracy(%)

	Tensorflow	CNTK	Theano
Case0	99.18	99.18	99.18
Case1	99.02	99.02	*85.17*
Case2	99.04	99.04	*83.57*
Case3	98.83	*70.23*	*82.83*
Case4	99.00	99.00	99.00

3) Model architecture

from keras.layers import Input, Conv2D, Dense, Flatten, MaxPooling2D, Activation
from keras.regularizers import l2
from keras.models import Model

class model:
    def __init__(self, input_shape, cls_num=10):
        self.name = 'LeNet-5'
        self.input_shape = input_shape
        self.cls_num = cls_num

        self.kernel_size = (5,5)
        self.weight_decay = 0.0001

    def build_model(self):
        input = Input(shape=self.input_shape)
        x = Conv2D(6, self.kernel_size, padding='valid', activation='relu', kernel_initializer='he_normal',
                         kernel_regularizer=l2(self.weight_decay))(input)
        x = MaxPooling2D(pool_size=(2, 2))(x)
        x = Conv2D(16, self.kernel_size, padding='valid', activation='relu', kernel_initializer='he_normal',
                         kernel_regularizer=l2(self.weight_decay))(x)

        x = MaxPooling2D((2, 2), strides=(2, 2),padding='valid')(x) # Case0
        #x = MaxPooling2D((2, 2), strides=(2, 2), padding='same')(x) # Case1
        #x = MaxPooling2D((2, 2), strides=(3, 3), padding='same')(x) # Case2
        #x = MaxPooling2D((2, 2), strides=(4, 4), padding='same')(x) # Case3
        #x = MaxPooling2D((2, 2), strides=(3, 3), padding='valid')(x) # Case4

        x = Flatten()(x)
        x = Dense(120, activation='relu', kernel_initializer='he_normal', kernel_regularizer=l2(self.weight_decay))(x)
        x = Dense(84, activation='relu', kernel_initializer='he_normal', kernel_regularizer=l2(self.weight_decay))(x)
        x = Dense(self.cls_num, kernel_initializer='he_normal', kernel_regularizer=l2(self.weight_decay))(x)
        x = Activation('softmax')(x)
        lenet5 = Model(input, x)
        return lenet5

Code to reproduce the issue

import os
import numpy as np
import keras.layers as L
import keras.backend as K
import importlib
from keras.models import load_model
from keras.engine import Model, Input

backends = ['cntk','tensorflow','theano']

# SoAD, Sum of Absolute Difference
def acc_abs_diff(output1, output2):
    assert output1.shape == output2.shape
    abs_diff = np.abs(output1-output2)
    return  np.sum(abs_diff)

def set_keras_backend(backend='tensorflow'):
    if K.backend() != backend:
        os.environ['KERAS_BACKEND'] = backend
        importlib.reload(K.load_backend)
        importlib.reload(K)
        assert K.backend() == backend

kwargs={'pool_size': 1, 'padding': 'same', 'strides': 1, 'data_format': 'channels_last'}
listdiff=[]
for stride_value in range(10):
    kwargs['strides']=stride_value+1
    input_data = (10 * np.random.random((1,32,32,16)))
    input = input_data.astype('float32')
    set_keras_backend('tensorflow')
    layer = L.pooling.MaxPool2D(**kwargs)# you can also use AveragePooling2D
    x = Input(batch_shape=input.shape)
    y = layer(x)
    bk_model = Model(x, y)
    model_path = os.path.join(yourpath, 'model.h5')
    bk_model.save(model_path, bk_model)
    output = {}
    for bk in backends:
        try:
            set_keras_backend(backend=bk)
            model = load_model(model_path)
            output[bk] = model.predict(input)
        except:
            print('error result')
    try:
        diff1 = acc_abs_diff(output['tensorflow'], output['cntk'])
    except:
        diff1 = None
    try:
        diff2 = acc_abs_diff(output['theano'], output['cntk'])
    except:
        diff2 = None
    try:
        diff3 = acc_abs_diff(output['theano'], output['tensorflow'])
    except:
        diff3 = None
    listdiff.append([diff1, diff2, diff3])
arraydiff = np.array(listdiff)
print(arraydiff)
print('finish')

keras-team / keras