There is a 12x performance deterioration running our models between RC1 - > RC2.
Looks MPI related given drastic increase in gradient aggregation (Windows/MSMPI/RDMA) On a positive note, GPU memory usage improved and so did the individual GPU loads.

RC1

Starting minibatch loop, DataParallelSGD training (myRank = 0, numNodes = 3, numGradientBits = 8), distributed reading is ENABLED.
(GPU): creating curand object with seed 4
(GPU): creating curand object with seed 5
Actual gradient aggregation time: 0.0367066
 Epoch[ 3 of 4]-Minibatch[   1-  10, 0.02%]: ce = 0.92235657 * 14000; errs = 30.050% * 14000; time = 1.7399s; samplesPerSecond = 8046.3
Actual gradient aggregation time: 0.0395949
 Epoch[ 3 of 4]-Minibatch[  11-  20, 0.04%]: ce = 0.91100281 * 14000; errs = 29.821% * 14000; time = 1.7430s; samplesPerSecond = 8031.9
Actual gradient aggregation time: 0.0392854
 Epoch[ 3 of 4]-Minibatch[  21-  30, 0.06%]: ce = 0.90469056 * 14000; errs = 29.750% * 14000; time = 1.7833s; samplesPerSecond = 7850.5
Actual gradient aggregation time: 0.0461543
 Epoch[ 3 of 4]-Minibatch[  31-  40, 0.08%]: ce = 0.89906226 * 14000; errs = 29.814% * 14000; time = 1.7509s; samplesPerSecond = 7996.0
Actual gradient aggregation time: 0.0279564
 Epoch[ 3 of 4]-Minibatch[  41-  50, 0.10%]: ce = 0.94234505 * 14000; errs = 31.343% * 14000; time = 1.7514s; samplesPerSecond = 7993.6
Actual gradient aggregation time: 0.045588

RC2

Starting minibatch loop, DataParallelSGD training (myRank = 0, numNodes = 3, numGradientBits = 8), distributed reading is ENABLED.
(GPU): creating curand object with seed 4
(GPU): creating curand object with seed 5
Actual gradient aggregation time: 0.0266452
 Epoch[ 3 of 4]-Minibatch[   1-  10, 0.02%]: ce = 0.92235663 * 14000; errs = 30.050% * 14000; time = 147.8332s; samplesPerSecond = 94.7
Actual gradient aggregation time: 0.758481
 Epoch[ 3 of 4]-Minibatch[  11-  20, 0.04%]: ce = 0.91100301 * 14000; errs = 29.821% * 14000; time = 22.1936s; samplesPerSecond = 630.8
Actual gradient aggregation time: 0.566061
 Epoch[ 3 of 4]-Minibatch[  21-  30, 0.06%]: ce = 0.90469129 * 14000; errs = 29.750% * 14000; time = 22.4593s; samplesPerSecond = 623.4
Actual gradient aggregation time: 1.17804
 Epoch[ 3 of 4]-Minibatch[  31-  40, 0.08%]: ce = 0.89906641 * 14000; errs = 29.814% * 14000; time = 22.2499s; samplesPerSecond = 629.2
Actual gradient aggregation time: 0.589015
 Epoch[ 3 of 4]-Minibatch[  41-  50, 0.10%]: ce = 0.94234560 * 14000; errs = 31.343% * 14000; time = 22.0024s; samplesPerSecond = 636.3
Actual gradient aggregation time: 0.657632

Have any of the defaults for brainscipt changed? What happened? Model uses dataParallelSGD:

            dataParallelSGD = [
                     gradientBits = 8 
                useBufferedAsyncGradientAggregation= false 
           ]

The hardware/software setup is exactly the same between RC1 and RC2. The only update had been replacing the cntk bin directory. Brainscipt models exactly same.

        Built time: Apr 21 2017 01:37:43
        Last modified date: Fri Apr 21 01:22:28 2017
        Build type: Release
        Build target: GPU
        With 1bit-SGD: yes
        With ASGD: yes
        Math lib: mkl
        CUDA_PATH: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0
        CUB_PATH: c:\local\cub-1.4.1
        CUDNN_PATH: C:\local\cudnn-8.0-windows10-x64-v5.1
        Build Branch: HEAD
        Build SHA1: f190457aecbfe07c3fe27c89f36108bc2488727b
        Built by svcphil on DPHAIM-25
        Build Path: C:\jenkins\workspace\CNTK-Build-Windows\Source\CNTKv2LibraryDll\
        MPI distribution: Microsoft MPI
        MPI version: 7.0.12437.6

Do you see the perf degradation if you run on a single node only? Could you please tell which reader you are using? It could be regression either in the reader or in the aggregation. If single node perf is the same - probably in aggregation. I would suggest running this with profiler on in order to understand which part regressed: https://github.com/Microsoft/CNTK/wiki/BrainScript-and-Python-Performance-Profiler Thank you!

@eldakms only tested with 3 nodes. CNTKTextFormatReader

the test model is simple CovBNReLU/DropOut/MaxPooling/CovBNReLU/Dropout/MaxPooling/Sigmoid/Linear :

conv1_act.c: using cuDNN convolution engine for geometry: Input: 65 x 65 x 1, Output: 65 x 65 x 16, Kernel: 5 x 5 x 1, Map: 16, Stride: 1 x 1 x 1, Sharing: (1, 1, 1), AutoPad: (1, 1, 1), LowerPad: 0 x 0 x 0, UpperPad: 0 x 0 x 0.
Using cuDNN batch normalization engine.
pool1: using cuDNN convolution engine for geometry: Input: 65 x 65 x 16, Output: 32 x 32 x 16, Kernel: 2 x 2 x 1, Map: 1, Stride: 2 x 2 x 1, Sharing: (1), AutoPad: (0), LowerPad: 0, UpperPad: 0.
conv2_act.c: using cuDNN convolution engine for geometry: Input: 32 x 32 x 16, Output: 32 x 32 x 32, Kernel: 5 x 5 x 16, Map: 32, Stride: 1 x 1 x 16, Sharing: (1, 1, 1), AutoPad: (1, 1, 1), LowerPad: 0 x 0 x 0, UpperPad: 0 x 0 x 0.
Using cuDNN batch normalization engine.
pool2: using cuDNN convolution engine for geometry: Input: 32 x 32 x 32, Output: 16 x 16 x 32, Kernel: 2 x 2 x 1, Map: 1, Stride: 2 x 2 x 1, Sharing: (1), AutoPad: (0), LowerPad: 0, UpperPad: 0.
Model has 51 nodes.
Memory Sharing: Out of 84 matrices, 51 are shared as 12, and 33 are not shared.
Training 1063677 parameters in 10 out of 10 parameter tensors and 33 nodes with gradient:
    Node 'conv1_act.b' (LearnableParameter operation) : [16 x 1]
    Node 'conv1_act.sc' (LearnableParameter operation) : [16 x 1]
    Node 'conv1_act.w' (LearnableParameter operation) : [16 x 25]
    Node 'conv2_act.b' (LearnableParameter operation) : [32 x 1]
    Node 'conv2_act.sc' (LearnableParameter operation) : [32 x 1]
    Node 'conv2_act.w' (LearnableParameter operation) : [32 x 400]
    Node 'h1.W' (LearnableParameter operation) : [128 x 8192]
    Node 'h1.b' (LearnableParameter operation) : [128 x 1]
    Node 'ol.W' (LearnableParameter operation) : [13 x 128]
    Node 'ol.b' (LearnableParameter operation) : [13 x 1]
...
WARNING: ci Times operation: being unrolled, execution may be slow

Config:

            dataParallelSGD = [
                    gradientBits = 8 
                useBufferedAsyncGradientAggregation= false 
           ]
...
    reader = [
           readerType = "CNTKTextFormatReader"
           file = "$DataDir$/$m_trainSet_Fname$"
           randomize = true
       randomizationWindow = $dataWindow$ 
           chunkSizeInBytes = 1073741824 #1GB chunks
           frameMode = true
           keepDataInMemory = true
...     
    ]

Will run the single node and profiler on 3 asap

@eldakms i ran profiling today. it shows multiple performance issues in RC2 - you dont see that in your tests??? Is the precision = "float" still a valid in Brainscript? Cant think of anything else that would trigger this.

In non-parallel configuration (main node - rank0)

RC1 - 100% GPU load / ~3700 samples per second RC2 - 100% GPU load / ~300 samples per second

In parallel configuration

RC1 (3 nodes x 1 GPU)

rank0:
_Minibatch Iteration      :       136.605 ms        45.273 ms       118.394 ms       855.775 ms             1000     00:02:16.605
__Get Minibatch           :         0.054 ms         0.024 ms         0.043 ms         0.777 ms             1000        54.039 ms
__Forward + Backward      :        91.176 ms         4.120 ms        77.692 ms       106.447 ms             1000     00:01:31.176
__Gradient Aggregation    :        44.763 ms        45.754 ms        15.414 ms       764.466 ms             1000     44763.320 ms
__Weight Update           :         0.553 ms         0.333 ms         0.485 ms         3.559 ms             1000       553.490 ms
__Post Processing         :         0.053 ms         0.080 ms         0.021 ms         0.356 ms             1000        52.521 ms
Data Reader
Prefetch Minibatch        :         0.704 ms         1.426 ms         0.569 ms        45.594 ms             1000       704.400 ms
-------------------------------------------------------------------------------------------------------
rank1:
_Minibatch Iteration      :       136.603 ms        45.288 ms       118.166 ms       855.992 ms             1000     00:02:16.603
__Get Minibatch           :         0.044 ms         0.027 ms         0.032 ms         0.876 ms             1000        44.248 ms
__Forward + Backward      :       124.579 ms         5.516 ms       107.069 ms       139.218 ms             1000     00:02:04.579
__Gradient Aggregation    :        11.426 ms        45.421 ms         5.127 ms       736.461 ms             1000     11425.689 ms
__Weight Update           :         0.512 ms         0.020 ms         0.464 ms         0.783 ms             1000       512.319 ms
__Post Processing         :         0.037 ms         0.043 ms         0.020 ms         0.249 ms             1000        37.200 ms
Data Reader
Prefetch Minibatch        :         0.696 ms         1.248 ms         0.568 ms        40.092 ms             1000       695.817 ms
-------------------------------------------------------------------------------------------------------
rank2:
_Minibatch Iteration      :       136.603 ms        45.283 ms       118.299 ms       855.710 ms             1000     00:02:16.603
__Get Minibatch           :         0.043 ms         0.024 ms         0.036 ms         0.775 ms             1000        42.501 ms
__Forward + Backward      :       122.459 ms        45.875 ms       101.458 ms       848.187 ms             1000     00:02:02.459
__Gradient Aggregation    :        13.576 ms         7.251 ms         5.451 ms        68.054 ms             1000     13576.037 ms
__Weight Update           :         0.481 ms         0.069 ms         0.459 ms         2.443 ms             1000       480.757 ms
__Post Processing         :         0.039 ms         0.044 ms         0.019 ms         0.219 ms             1000        39.316 ms
Data Reader
Prefetch Minibatch        :         0.618 ms         0.815 ms         0.504 ms        26.137 ms             1000       617.792 ms

RC2 (3 nodes x 1 GPU)

rank0:
_Minibatch Iteration      :      1627.321 ms        54.895 ms      1480.077 ms      1874.616 ms             1000     00:27:07.321
__Get Minibatch           :         0.077 ms         0.021 ms         0.056 ms         0.714 ms             1000        76.728 ms
__Forward + Backward      :      1095.774 ms        51.014 ms       926.958 ms      1251.470 ms             1000     00:18:15.774
__Gradient Aggregation    :       530.792 ms        95.943 ms       227.960 ms       883.328 ms             1000     00:08:50.792
__Weight Update           :         0.604 ms         0.323 ms         0.502 ms         3.601 ms             1000       604.144 ms
__Post Processing         :         0.066 ms         0.084 ms         0.031 ms         0.513 ms             1000        65.853 ms
Data Reader
Prefetch Minibatch        :         0.788 ms         0.903 ms         0.572 ms        29.237 ms             1000       787.892 ms
-------------------------------------------------------------------------------------------------------
rank1:
_Minibatch Iteration      :      1627.298 ms        54.858 ms      1480.142 ms      1877.182 ms             1000     00:27:07.298
__Get Minibatch           :         0.065 ms         0.029 ms         0.050 ms         0.948 ms             1000        64.823 ms
__Forward + Backward      :      1585.589 ms        70.799 ms      1346.167 ms      1768.945 ms             1000     00:26:25.589
__Gradient Aggregation    :        41.076 ms        63.901 ms         5.492 ms       415.975 ms             1000     41075.833 ms
__Weight Update           :         0.514 ms         0.027 ms         0.472 ms         0.750 ms             1000       514.409 ms
__Post Processing         :         0.047 ms         0.046 ms         0.028 ms         0.459 ms             1000        47.403 ms
Data Reader
Prefetch Minibatch        :         0.755 ms         1.025 ms         0.593 ms        33.054 ms             1000       754.808 ms
-------------------------------------------------------------------------------------------------------
rank2:
_Minibatch Iteration      :      1627.260 ms        54.849 ms      1480.282 ms      1876.972 ms             1000     00:27:07.260
__Get Minibatch           :         0.062 ms         0.021 ms         0.041 ms         0.677 ms             1000        62.032 ms
__Forward + Backward      :      1544.042 ms        75.062 ms      1338.964 ms      1869.853 ms             1000     00:25:44.042
__Gradient Aggregation    :        82.624 ms        83.299 ms         6.368 ms       390.663 ms             1000     00:01:22.624
__Weight Update           :         0.479 ms         0.028 ms         0.459 ms         0.827 ms             1000       479.286 ms
__Post Processing         :         0.046 ms         0.046 ms         0.026 ms         0.500 ms             1000        46.275 ms
Data Reader
Prefetch Minibatch        :         0.706 ms         0.852 ms         0.534 ms        27.312 ms             1000       706.199 ms

NB: MB size and Epoch sizes were reduced and model simplified vs original performance numbers in samples per second on the top of the thread.

I will double check with the team and come back to you.

@avader906 I tested RC1 and RC2 on Ubuntu 14.04.1 with single GPU Titan X Maxwell on ResNet example and didn't find perf degradations. Can you share your script and more details about machine spec for me to check further?

@KeDengMS, @eldakms seems that regression is introduced either/and/or in

operations on the vector / tensor product inside (Brainscript)
the CNTKTextFormatReader reader
and/or elsewhere

GPU load could indicate that precision = "float" (in the root of BS config file) is ignored and CUDA kernels run on double precision. It would not explain the decrease in gradient aggregation performance though - unless aggregation is done on GPU and in double precision too.

I will prepare an example. These are my notes - everything should be easily reproduced.

The test setup is Brainscript / CNTK.exe / windows 2012 r2 fully patched. Single node / single kepler GPU (3.5 capability). MSMPI (tested on v7) setup uses 3 nodes x 1 kepler GPU each - profiling done in that context.

input vector of dim D is transformed to DxD via outer product. A sample of input transformation used

    # define inputs

    iInputs  = 65
    iOutputs = 13

    i = Input {(65:1)}
    o = Input {iOutputs}

    ti = Transpose(i)
    ci = Times(i, ti)
    ri = SplitDimension(ci, 2, 1)

the test model is simple CovBNReLU/DropOut/MaxPooling/CovBNReLU/Dropout/MaxPooling/Sigmoid/Linear

@avader906, I found a perf degradation between beta15 and RC1, that image deserializer does not turn on multithreaded_deserializer by default now. This bug seems to affect Windows python build most. Though it looks different from your issue which claimed a degradation in RC2 in BrainScript, can you check the CPU utilization in your repro to see if there's difference between RC1 and RC2?

@KeDengMS we use brainscript / cntk.exe. Should (for performance reasons) CNTKTextFormatDeserializer be used instead of CNTKTextFormatReader everywhere ?

we will modify the model to use the composite reader - if there is a performance benefit. the brainscript key is multiThreadedDeserialization ?

please take a loot at profiling i sent earlier. degradation mainly happens in __Forward + Backward and __Gradient Aggregation.

we can see GPU loaded 100% but samples per second drop x12 vs RC1. one scenario would be that GPU, for some reason, calculates the model in double precision.

the brainscript model explicitly sets precision to float, so regression could be in

reader ignores the float precision flag
one or more of Transpose(), Times(), SplitDimension() functions have explicit cast to double
you cast input tensor to double during GPU I/O

this is the sanitized version of our test model (65 floats input vector and 13 binary labels) that shows the described regression when we substitute RC1 cntk folder with RC2:

RootDir = "c:/localcntk/test_models"
ConfigDir = "$RootDir$/Config"
DataDir = "$RootDir$/Data"
OutputDir = "$RootDir$/Output"
ModelDir = "$OutputDir$/Models"

#+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
p_Model     =   "unit_tests_m2_pOn"
#+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-

# M2 test unit parameters 
# ******************************************************************
modelPath = "$ModelDir$/$p_Model$_CNN_c2_f2.dnn"

#GPU only
deviceId = 0 #GPU

#reader dataWindow parameter in samples 
dataWindow = 30000000 

#**********************
profilerEnabled = true
#**********************

#profiling configuration - limit epoch size and number of epochs
epochWindow = 1000000      
totalEpochs = 2 

#execute training
command = run_Train 

#######################################
custom_minibatchSize = 1000 
#######################################

# Global test unit configuration
# ******************************************************************

#Global parameters
precision = "float"
parallelTrain = false
numCPUThreads = 12 
keepCheckPointFiles  =  true

# Reader parameters 
prefetch = true
hyperCompressMemory = false

# Logging
stderr = "$OutputDir$/$p_Model$_CNN_c2_f2"
traceGPUMemoryAllocations=0 #trace memory usage
traceLevel = 1 #0 (limited output), 1 (medium output) and 2 (verbose output)

p_TrainSet  =   "trainSet"
p_ValidSet  =   "validSet"
p_TestSet   =   "testSet"
m_trainSet_Fname = "$p_TrainSet$_$p_Model$.txt"
m_validSet_Fname = "$p_ValidSet$_$p_Model$.txt"
m_testSet_Fname = "$p_TestSet$_$p_Model$.txt"
m_testAgainst_Fname = "$p_TestSet$_$p_TestAgainst$.txt"

#######################################
#  TRAINING CONFIG                                               
#######################################

run_Train=[
       action = "train"
       BrainScriptNetworkBuilder = {

              LeakyReLU (x) = 0.10 * x + 0.90 * ReLU (x)

              DNNSigmoidLayer (inDim, outDim, x, parmScale) = [
                 W = Parameter (outDim, inDim, init="uniform", initValueScale=parmScale, initOnCPUOnly=true)
                 b = Parameter (outDim, 1, init="fixedValue", value=0, initOnCPUOnly=true) 
                 y = Sigmoid (W * x + b)
              ].y

              DNNLayer(inDim, outDim, x, parmScale) = [  
                 W = Parameter(outDim, 0/*inDim, gets inferred*/, init="uniform", initValueScale=parmScale, initOnCPUOnly=true)
                  b = Parameter(outDim, 1, init="fixedValue",  value=0, initOnCPUOnly=true)
                  out = W * x + b
              ].out

              ConvBNReLULayer(inp, outMap, kW, kH, inMap, hStride, vStride, wScale, bValue, scValue, bnTimeConst) = [
                 inWCount = kW * kH * inMap
                 w = LearnableParameter (outMap, inWCount, learningRateMultiplier = 1.0, init = 'gaussian', initValueScale = wScale, value = 0, initFromFilePath = '', initFromLiteral = '', initOnCPUOnly=true, randomSeed=-1, tag='')
                 c = Convolution(w, inp, (kW:kH:inMap), mapDims=outMap, stride=(hStride:vStride:inMap), sharing = true, autoPadding = true, lowerPad = 0, upperPad = 0)
                 b = Parameter(outMap, 1, init="fixedValue", value=bValue)
                 sc = Parameter(outMap, 1, init="fixedValue", value=scValue)
                 m = Parameter(outMap, 1, init="fixedValue", value=0, learningRateMultiplier=0)
                 isd = Parameter(outMap, 1, init="fixedValue", value=0, learningRateMultiplier=0)
                 y = BatchNormalization(c, sc, b, m, isd, spatial  = true, normalizationTimeConstant = 5000, blendTimeConstant = 0, epsilon = 0.00001, useCntkEngine = false)
                 out = LeakyReLU(y)
              ].out

              # define inputs
              iInputs  = 65
              iOutputs = 13
              i = Input {(65:1)}
              o = Input {iOutputs}

              ti = Transpose(i)
              ci = Times(i, ti)
              ri = SplitDimension(ci, 2, 1)

              imageW = iInputs
              imageH = iInputs

              # conv1
              kW1 = 5
              kH1 = 5 
              cMap1 = 16
              hStride1 = 1
              vStride1 = 1
              conv1_act = ConvBNReLULayer(ri, cMap1, kW1, kH1, 1, hStride1, vStride1, 10, 1, 1, 1)
              d1 = Dropout(conv1_act)

              # pool1
              pool1W = 2
              pool1H = 2
              pool1hStride = 2
              pool1vStride = 2
              pool1 = MaxPooling (d1, pool1W, pool1H, pool1hStride, pool1vStride, imageLayout="CHW")

              # conv2
              kW2 = 5
              kH2 = 5
              cMap2 = 32 
              hStride2 = 1
              vStride2 = 1
              conv2_act = ConvBNReLULayer(pool1, cMap2, kW2, kH2, cMap1, hStride2, vStride2, 10, 1, 1, 1)
              d2 = Dropout(conv2_act)

              # pool2
              pool2W = 2
              pool2H = 2
              pool2hStride = 2
              pool2vStride = 2
              pool2 = MaxPooling (d2, pool2W, pool2H, pool2hStride, pool2vStride, imageLayout="CHW")

              h1Dim = 128
              h1 = DNNSigmoidLayer(16*16*cMap2, h1Dim, FlattenDimensions (pool2, 1, 3), 1)
              ol = DNNLayer(h1Dim, iOutputs, h1, 1)

              ce = CrossEntropyWithSoftmax (o, ol)
              errs = ErrorPrediction (o, ol)

              top5Errs = ClassificationError (o, ol, topN=5) # only used in Eval action

              # Special nodes
              featureNodes    = (i)
              labelNodes      = (o)
              criterionNodes  = (ce)
              evaluationNodes = (errs)
              outputNodes     = (ol)

              #additional output
              #y = Times(x,x, tag='output')
       }    

       # SGD learner configuration
       SGD = [
              #epoch size
              epochSize = $epochWindow$ 
              minibatchSize = $custom_minibatchSize$ 
              maxEpochs = $totalEpochs$

              #learning rate
              learningRatesPerSample   = 0.00000533 #modified to equal alpha
              minLearningRatePerSample = 0.000001 

              #momentum
              momentumAsTimeConstant = 300*1:1527*1:2741*1:13288  
              useNAG = true #nesterov momentum

              #dropout      
              dropoutRate = 0.5

              #L2RegWeight=0.0001
              #L1RegWeight=0.0001

              # Additional optional parameters are: distributedMBReading
              parallelTrain = [

                 parallelizationMethod = "DataParallelSGD"
                 parallelizationStartEpoch = 1 #no warm-start
                 distributedMBReading = true 
                 syncPerfStats = 10

                 # Additional optional parameters are: useZeroThresholdFor1BitQuantization
                 dataParallelSGD = [
                    gradientBits = 8
                    useBufferedAsyncGradientAggregation= false #true seems to cause perf penalty / wait
                 ]
                 autoAdjust = [
                    autoAdjustLR = "adjustAfterEpoch"
                    reduceLearnRateIfImproveLessThan = 0 
                    continueReduce = false
                    increaseLearnRateIfImproveMoreThan = 1000

                    loadBestModel = true
                    learnRateAdjustInterval = 1

                    learnRateDecreaseFactor = 0.36 #prior good value
                    learnRateIncreaseFactor = 1.382
                    numPrevLearnRates = 3 
                    numBestSearchEpoch = 1

                    autoAdjustMinibatch = false
                    #minibatchSizeTuningFrequency = 2 # try to enlarge after this many epochs
                    #numMiniBatch4LRSearch = 200
                    #minibatchSizeTuningMax = 25000 # out of memory above this
                 ]
              ]
              gradUpdateType = "fsAdaGrad"
       ]

       # reader configuration
       reader = [
          readerType = "CNTKTextFormatReader"
          file = "$DataDir$/$m_testSet_Fname$"

          randomize = true
          randomizationWindow = $dataWindow$ 
          chunkSizeInBytes = 1073741824 #1GB chunks
          frameMode = true

          keepDataInMemory = true

          input = [
             i = [
                alias = "i"
                dim = 65        
                format = "dense"
             ]
             o = [
                alias = "o"
                dim = 13
                format = "dense"
             ]
          ]
       ]
]

I can confirm this is not a reader issue. We have tried with the composite reader and RC2 throughput is still x12 times lower (than RC1). This leaves brainscript functions (per above model) as likely culprit.

@KeDengMS in RC1, switching to composite reader with multiThreadedDeserialization = true results in

training startup times reduced
healthy CPU load (CPUs actually used),
increase in samples per second training throughput of around 6.5% and MB time variance dropped dramatically, Epoch speed up by 10%
and (more importantly) balanced loads on GPU which makes CUDA kernels more stable on non-Tesla hardware

May i suggest you update/revise the documentation to make clear that CNTKTextFormatReader is depricated - it surely is judging by performance and uniform load benefits. I do not know if you given up on updating Wiki, but i can see the cognitive toolkit web pages would benefit as well here and here. It is confusing to have documentation refer to depricated functionality. Instead it should show how to update the code.

In any case, many thanks for pointing out to composite reader and the work done on readers since early betas.

NB: reader configuration changed from

       reader = [
          readerType = "CNTKTextFormatReader"
          file = "$DataDir$/$m_testSet_Fname$"

          randomize = true
          randomizationWindow = $dataWindow$ 
          chunkSizeInBytes = 1073741824 #1GB chunks
          frameMode = true

          keepDataInMemory = true

          input = [
             i = [
                alias = "i"
                dim = 65        
                format = "dense"
             ]
             o = [
                alias = "o"
                dim = 13
                format = "dense"
             ]
          ]
       ]

    reader = [
          verbosity = 0
          traceLevel = 0
          precision = "float"

          randomize = true
          sampleBasedRandomizationWindow = true
          multiThreadedDeserialization = true
          randomizationWindow = $dataWindow$ #in samples
          chunkSizeInBytes = 1073741824 #1GB chunks
          frameMode = true

          keepDataInMemory = true

          deserializers = (
             [
                type = "CNTKTextFormatDeserializer"
                module = "CNTKTextFormatReader"
                file = "$DataDir$/$m_trainSet_Fname$"
                input = [
                   i = [
                      alias = "i"
                      dim = 65        
                      format = "dense"
                   ]
                   o = [
                      alias = "o"
                      dim = 13
                      format = "dense"
                      ]
                   ]
             ]
          )
    ]

Got a repro now. This is because of a bug fix in RC2 for asymmetric padding in convolution that falls back to slow reference engine. cudnn has a bug in asymmetric padding that the results are broken. I think padding on channel dimension is not intended right?

pool1: using cuDNN convolution engine for geometry: Input: 65 x 65 x 16, Output: 32 x 32 x 16, Kernel: 2 x 2 x 1, Map: 1, Stride: 2 x 2 x 1, Sharing: (1), AutoPad: (0), LowerPad: 0, UpperPad: 0. WARNING: Detected asymmetric padding issue with even kernel size and lowerPad (247) < higherPad (249) (i=2), cuDNN will not be able to produce correct result. Switch to reference engine (VERY SLOW). conv2_act.c: using reference convolution engine for geometry, could be VERY SLOW: Input: 32 x 32 x 16, Output: 32 x 32 x 32, Kernel: 5 x 5 x 16, Map: 32, Stride: 1 x 1 x 16, Sharing: (1, 1, 1), AutoPad: (1, 1, 1), LowerPad: 0 x 0 x 0, UpperPad: 0 x 0 x 0. Using cuDNN batch normalization engine.

I changed the script to fix the asymmetric padding and then the speed is the same as RC1, with this line changed: c = Convolution(w, inp, (kW:kH:inMap), mapDims=outMap, stride=(hStride:vStride:inMap), sharing = true, autoPadding = (true:true:false), lowerPad = 0, upperPad = 0)

@KeDengMS thank you! to be clear - does this invalidate any models trained with that script, RC1 with *autoPadding = true? in other words, should those models be retrained with RC2 and autoPadding = (true:true:false) ? references:

909

nvidia convolutionFFT2D sdk (developer.download.nvidia.com pdf) Caffe conv layer implementation (caffe.berkeleyvision.org)

This needs to be double checked.

I checked in a fix in the latest master. You should not see this issue again even if you are using the original script.

@cha-zhang thank you. we are rebuilding the test lab from win2012/msmpi-ND to ubuntu/openmpi-openib - so unable to check, but from commit b9ac8f38488ebb1cc0550ae3323313a52ef3a7af it looks fine.

I observe the following issue if I use 4x4 kernels input even input sizes with padding autopadding=[False, True, True]: WARNING: Detected asymmetric padding issue with even kernel size and lowerPad (1) < higherPad (2) (i=0), cuDNN will not be able to produce correct result. Switch to reference engine (VERY SLOW).

4x4 kernel is asymmetric by nature. Maybe you can pad it to 5x5 instead? @BowenBao for more suggestions.

microsoft / CNTK

[RC2] severe (x12) performance degradation from RC1 - > RC2 [binary win GPU with 1bit SGD release] #1862

In non-parallel configuration (main node - rank0)

In parallel configuration

909