Closed ghost closed 7 years ago
Do you see the perf degradation if you run on a single node only? Could you please tell which reader you are using? It could be regression either in the reader or in the aggregation. If single node perf is the same - probably in aggregation. I would suggest running this with profiler on in order to understand which part regressed: https://github.com/Microsoft/CNTK/wiki/BrainScript-and-Python-Performance-Profiler Thank you!
@eldakms only tested with 3 nodes. CNTKTextFormatReader
the test model is simple CovBNReLU/DropOut/MaxPooling/CovBNReLU/Dropout/MaxPooling/Sigmoid/Linear :
conv1_act.c: using cuDNN convolution engine for geometry: Input: 65 x 65 x 1, Output: 65 x 65 x 16, Kernel: 5 x 5 x 1, Map: 16, Stride: 1 x 1 x 1, Sharing: (1, 1, 1), AutoPad: (1, 1, 1), LowerPad: 0 x 0 x 0, UpperPad: 0 x 0 x 0.
Using cuDNN batch normalization engine.
pool1: using cuDNN convolution engine for geometry: Input: 65 x 65 x 16, Output: 32 x 32 x 16, Kernel: 2 x 2 x 1, Map: 1, Stride: 2 x 2 x 1, Sharing: (1), AutoPad: (0), LowerPad: 0, UpperPad: 0.
conv2_act.c: using cuDNN convolution engine for geometry: Input: 32 x 32 x 16, Output: 32 x 32 x 32, Kernel: 5 x 5 x 16, Map: 32, Stride: 1 x 1 x 16, Sharing: (1, 1, 1), AutoPad: (1, 1, 1), LowerPad: 0 x 0 x 0, UpperPad: 0 x 0 x 0.
Using cuDNN batch normalization engine.
pool2: using cuDNN convolution engine for geometry: Input: 32 x 32 x 32, Output: 16 x 16 x 32, Kernel: 2 x 2 x 1, Map: 1, Stride: 2 x 2 x 1, Sharing: (1), AutoPad: (0), LowerPad: 0, UpperPad: 0.
Model has 51 nodes.
Memory Sharing: Out of 84 matrices, 51 are shared as 12, and 33 are not shared.
Training 1063677 parameters in 10 out of 10 parameter tensors and 33 nodes with gradient:
Node 'conv1_act.b' (LearnableParameter operation) : [16 x 1]
Node 'conv1_act.sc' (LearnableParameter operation) : [16 x 1]
Node 'conv1_act.w' (LearnableParameter operation) : [16 x 25]
Node 'conv2_act.b' (LearnableParameter operation) : [32 x 1]
Node 'conv2_act.sc' (LearnableParameter operation) : [32 x 1]
Node 'conv2_act.w' (LearnableParameter operation) : [32 x 400]
Node 'h1.W' (LearnableParameter operation) : [128 x 8192]
Node 'h1.b' (LearnableParameter operation) : [128 x 1]
Node 'ol.W' (LearnableParameter operation) : [13 x 128]
Node 'ol.b' (LearnableParameter operation) : [13 x 1]
...
WARNING: ci Times operation: being unrolled, execution may be slow
Config:
dataParallelSGD = [
gradientBits = 8
useBufferedAsyncGradientAggregation= false
]
...
reader = [
readerType = "CNTKTextFormatReader"
file = "$DataDir$/$m_trainSet_Fname$"
randomize = true
randomizationWindow = $dataWindow$
chunkSizeInBytes = 1073741824 #1GB chunks
frameMode = true
keepDataInMemory = true
...
]
Will run the single node and profiler on 3 asap
@eldakms i ran profiling today. it shows multiple performance issues in RC2 - you dont see that in your tests??? Is the precision = "float"
still a valid in Brainscript? Cant think of anything else that would trigger this.
RC1 - 100% GPU load / ~3700 samples per second RC2 - 100% GPU load / ~300 samples per second
RC1 (3 nodes x 1 GPU)
rank0:
_Minibatch Iteration : 136.605 ms 45.273 ms 118.394 ms 855.775 ms 1000 00:02:16.605
__Get Minibatch : 0.054 ms 0.024 ms 0.043 ms 0.777 ms 1000 54.039 ms
__Forward + Backward : 91.176 ms 4.120 ms 77.692 ms 106.447 ms 1000 00:01:31.176
__Gradient Aggregation : 44.763 ms 45.754 ms 15.414 ms 764.466 ms 1000 44763.320 ms
__Weight Update : 0.553 ms 0.333 ms 0.485 ms 3.559 ms 1000 553.490 ms
__Post Processing : 0.053 ms 0.080 ms 0.021 ms 0.356 ms 1000 52.521 ms
Data Reader
Prefetch Minibatch : 0.704 ms 1.426 ms 0.569 ms 45.594 ms 1000 704.400 ms
-------------------------------------------------------------------------------------------------------
rank1:
_Minibatch Iteration : 136.603 ms 45.288 ms 118.166 ms 855.992 ms 1000 00:02:16.603
__Get Minibatch : 0.044 ms 0.027 ms 0.032 ms 0.876 ms 1000 44.248 ms
__Forward + Backward : 124.579 ms 5.516 ms 107.069 ms 139.218 ms 1000 00:02:04.579
__Gradient Aggregation : 11.426 ms 45.421 ms 5.127 ms 736.461 ms 1000 11425.689 ms
__Weight Update : 0.512 ms 0.020 ms 0.464 ms 0.783 ms 1000 512.319 ms
__Post Processing : 0.037 ms 0.043 ms 0.020 ms 0.249 ms 1000 37.200 ms
Data Reader
Prefetch Minibatch : 0.696 ms 1.248 ms 0.568 ms 40.092 ms 1000 695.817 ms
-------------------------------------------------------------------------------------------------------
rank2:
_Minibatch Iteration : 136.603 ms 45.283 ms 118.299 ms 855.710 ms 1000 00:02:16.603
__Get Minibatch : 0.043 ms 0.024 ms 0.036 ms 0.775 ms 1000 42.501 ms
__Forward + Backward : 122.459 ms 45.875 ms 101.458 ms 848.187 ms 1000 00:02:02.459
__Gradient Aggregation : 13.576 ms 7.251 ms 5.451 ms 68.054 ms 1000 13576.037 ms
__Weight Update : 0.481 ms 0.069 ms 0.459 ms 2.443 ms 1000 480.757 ms
__Post Processing : 0.039 ms 0.044 ms 0.019 ms 0.219 ms 1000 39.316 ms
Data Reader
Prefetch Minibatch : 0.618 ms 0.815 ms 0.504 ms 26.137 ms 1000 617.792 ms
RC2 (3 nodes x 1 GPU)
rank0:
_Minibatch Iteration : 1627.321 ms 54.895 ms 1480.077 ms 1874.616 ms 1000 00:27:07.321
__Get Minibatch : 0.077 ms 0.021 ms 0.056 ms 0.714 ms 1000 76.728 ms
__Forward + Backward : 1095.774 ms 51.014 ms 926.958 ms 1251.470 ms 1000 00:18:15.774
__Gradient Aggregation : 530.792 ms 95.943 ms 227.960 ms 883.328 ms 1000 00:08:50.792
__Weight Update : 0.604 ms 0.323 ms 0.502 ms 3.601 ms 1000 604.144 ms
__Post Processing : 0.066 ms 0.084 ms 0.031 ms 0.513 ms 1000 65.853 ms
Data Reader
Prefetch Minibatch : 0.788 ms 0.903 ms 0.572 ms 29.237 ms 1000 787.892 ms
-------------------------------------------------------------------------------------------------------
rank1:
_Minibatch Iteration : 1627.298 ms 54.858 ms 1480.142 ms 1877.182 ms 1000 00:27:07.298
__Get Minibatch : 0.065 ms 0.029 ms 0.050 ms 0.948 ms 1000 64.823 ms
__Forward + Backward : 1585.589 ms 70.799 ms 1346.167 ms 1768.945 ms 1000 00:26:25.589
__Gradient Aggregation : 41.076 ms 63.901 ms 5.492 ms 415.975 ms 1000 41075.833 ms
__Weight Update : 0.514 ms 0.027 ms 0.472 ms 0.750 ms 1000 514.409 ms
__Post Processing : 0.047 ms 0.046 ms 0.028 ms 0.459 ms 1000 47.403 ms
Data Reader
Prefetch Minibatch : 0.755 ms 1.025 ms 0.593 ms 33.054 ms 1000 754.808 ms
-------------------------------------------------------------------------------------------------------
rank2:
_Minibatch Iteration : 1627.260 ms 54.849 ms 1480.282 ms 1876.972 ms 1000 00:27:07.260
__Get Minibatch : 0.062 ms 0.021 ms 0.041 ms 0.677 ms 1000 62.032 ms
__Forward + Backward : 1544.042 ms 75.062 ms 1338.964 ms 1869.853 ms 1000 00:25:44.042
__Gradient Aggregation : 82.624 ms 83.299 ms 6.368 ms 390.663 ms 1000 00:01:22.624
__Weight Update : 0.479 ms 0.028 ms 0.459 ms 0.827 ms 1000 479.286 ms
__Post Processing : 0.046 ms 0.046 ms 0.026 ms 0.500 ms 1000 46.275 ms
Data Reader
Prefetch Minibatch : 0.706 ms 0.852 ms 0.534 ms 27.312 ms 1000 706.199 ms
NB: MB size and Epoch sizes were reduced and model simplified vs original performance numbers in samples per second on the top of the thread.
I will double check with the team and come back to you.
@avader906 I tested RC1 and RC2 on Ubuntu 14.04.1 with single GPU Titan X Maxwell on ResNet example and didn't find perf degradations. Can you share your script and more details about machine spec for me to check further?
@KeDengMS, @eldakms seems that regression is introduced either/and/or in
GPU load could indicate that precision = "float"
(in the root of BS config file) is ignored and CUDA kernels run on double precision. It would not explain the decrease in gradient aggregation performance though - unless aggregation is done on GPU and in double precision too.
I will prepare an example. These are my notes - everything should be easily reproduced.
The test setup is Brainscript / CNTK.exe / windows 2012 r2 fully patched. Single node / single kepler GPU (3.5 capability). MSMPI (tested on v7) setup uses 3 nodes x 1 kepler GPU each - profiling done in that context.
input vector of dim D is transformed to DxD via outer product. A sample of input transformation used
# define inputs
iInputs = 65
iOutputs = 13
i = Input {(65:1)}
o = Input {iOutputs}
ti = Transpose(i)
ci = Times(i, ti)
ri = SplitDimension(ci, 2, 1)
the test model is simple CovBNReLU/DropOut/MaxPooling/CovBNReLU/Dropout/MaxPooling/Sigmoid/Linear
@avader906, I found a perf degradation between beta15 and RC1, that image deserializer does not turn on multithreaded_deserializer by default now. This bug seems to affect Windows python build most. Though it looks different from your issue which claimed a degradation in RC2 in BrainScript, can you check the CPU utilization in your repro to see if there's difference between RC1 and RC2?
@KeDengMS we use brainscript / cntk.exe. Should (for performance reasons) CNTKTextFormatDeserializer be used instead of CNTKTextFormatReader everywhere ?
we will modify the model to use the composite reader - if there is a performance benefit. the brainscript key is multiThreadedDeserialization
?
please take a loot at profiling i sent earlier. degradation mainly happens in __Forward + Backward
and __Gradient Aggregation
.
we can see GPU loaded 100% but samples per second drop x12 vs RC1. one scenario would be that GPU, for some reason, calculates the model in double precision.
the brainscript model explicitly sets precision to float, so regression could be in
this is the sanitized version of our test model (65 floats input vector and 13 binary labels) that shows the described regression when we substitute RC1 cntk folder with RC2:
RootDir = "c:/localcntk/test_models"
ConfigDir = "$RootDir$/Config"
DataDir = "$RootDir$/Data"
OutputDir = "$RootDir$/Output"
ModelDir = "$OutputDir$/Models"
#+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
p_Model = "unit_tests_m2_pOn"
#+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
# M2 test unit parameters
# ******************************************************************
modelPath = "$ModelDir$/$p_Model$_CNN_c2_f2.dnn"
#GPU only
deviceId = 0 #GPU
#reader dataWindow parameter in samples
dataWindow = 30000000
#**********************
profilerEnabled = true
#**********************
#profiling configuration - limit epoch size and number of epochs
epochWindow = 1000000
totalEpochs = 2
#execute training
command = run_Train
#######################################
custom_minibatchSize = 1000
#######################################
# Global test unit configuration
# ******************************************************************
#Global parameters
precision = "float"
parallelTrain = false
numCPUThreads = 12
keepCheckPointFiles = true
# Reader parameters
prefetch = true
hyperCompressMemory = false
# Logging
stderr = "$OutputDir$/$p_Model$_CNN_c2_f2"
traceGPUMemoryAllocations=0 #trace memory usage
traceLevel = 1 #0 (limited output), 1 (medium output) and 2 (verbose output)
p_TrainSet = "trainSet"
p_ValidSet = "validSet"
p_TestSet = "testSet"
m_trainSet_Fname = "$p_TrainSet$_$p_Model$.txt"
m_validSet_Fname = "$p_ValidSet$_$p_Model$.txt"
m_testSet_Fname = "$p_TestSet$_$p_Model$.txt"
m_testAgainst_Fname = "$p_TestSet$_$p_TestAgainst$.txt"
#######################################
# TRAINING CONFIG
#######################################
run_Train=[
action = "train"
BrainScriptNetworkBuilder = {
LeakyReLU (x) = 0.10 * x + 0.90 * ReLU (x)
DNNSigmoidLayer (inDim, outDim, x, parmScale) = [
W = Parameter (outDim, inDim, init="uniform", initValueScale=parmScale, initOnCPUOnly=true)
b = Parameter (outDim, 1, init="fixedValue", value=0, initOnCPUOnly=true)
y = Sigmoid (W * x + b)
].y
DNNLayer(inDim, outDim, x, parmScale) = [
W = Parameter(outDim, 0/*inDim, gets inferred*/, init="uniform", initValueScale=parmScale, initOnCPUOnly=true)
b = Parameter(outDim, 1, init="fixedValue", value=0, initOnCPUOnly=true)
out = W * x + b
].out
ConvBNReLULayer(inp, outMap, kW, kH, inMap, hStride, vStride, wScale, bValue, scValue, bnTimeConst) = [
inWCount = kW * kH * inMap
w = LearnableParameter (outMap, inWCount, learningRateMultiplier = 1.0, init = 'gaussian', initValueScale = wScale, value = 0, initFromFilePath = '', initFromLiteral = '', initOnCPUOnly=true, randomSeed=-1, tag='')
c = Convolution(w, inp, (kW:kH:inMap), mapDims=outMap, stride=(hStride:vStride:inMap), sharing = true, autoPadding = true, lowerPad = 0, upperPad = 0)
b = Parameter(outMap, 1, init="fixedValue", value=bValue)
sc = Parameter(outMap, 1, init="fixedValue", value=scValue)
m = Parameter(outMap, 1, init="fixedValue", value=0, learningRateMultiplier=0)
isd = Parameter(outMap, 1, init="fixedValue", value=0, learningRateMultiplier=0)
y = BatchNormalization(c, sc, b, m, isd, spatial = true, normalizationTimeConstant = 5000, blendTimeConstant = 0, epsilon = 0.00001, useCntkEngine = false)
out = LeakyReLU(y)
].out
# define inputs
iInputs = 65
iOutputs = 13
i = Input {(65:1)}
o = Input {iOutputs}
ti = Transpose(i)
ci = Times(i, ti)
ri = SplitDimension(ci, 2, 1)
imageW = iInputs
imageH = iInputs
# conv1
kW1 = 5
kH1 = 5
cMap1 = 16
hStride1 = 1
vStride1 = 1
conv1_act = ConvBNReLULayer(ri, cMap1, kW1, kH1, 1, hStride1, vStride1, 10, 1, 1, 1)
d1 = Dropout(conv1_act)
# pool1
pool1W = 2
pool1H = 2
pool1hStride = 2
pool1vStride = 2
pool1 = MaxPooling (d1, pool1W, pool1H, pool1hStride, pool1vStride, imageLayout="CHW")
# conv2
kW2 = 5
kH2 = 5
cMap2 = 32
hStride2 = 1
vStride2 = 1
conv2_act = ConvBNReLULayer(pool1, cMap2, kW2, kH2, cMap1, hStride2, vStride2, 10, 1, 1, 1)
d2 = Dropout(conv2_act)
# pool2
pool2W = 2
pool2H = 2
pool2hStride = 2
pool2vStride = 2
pool2 = MaxPooling (d2, pool2W, pool2H, pool2hStride, pool2vStride, imageLayout="CHW")
h1Dim = 128
h1 = DNNSigmoidLayer(16*16*cMap2, h1Dim, FlattenDimensions (pool2, 1, 3), 1)
ol = DNNLayer(h1Dim, iOutputs, h1, 1)
ce = CrossEntropyWithSoftmax (o, ol)
errs = ErrorPrediction (o, ol)
top5Errs = ClassificationError (o, ol, topN=5) # only used in Eval action
# Special nodes
featureNodes = (i)
labelNodes = (o)
criterionNodes = (ce)
evaluationNodes = (errs)
outputNodes = (ol)
#additional output
#y = Times(x,x, tag='output')
}
# SGD learner configuration
SGD = [
#epoch size
epochSize = $epochWindow$
minibatchSize = $custom_minibatchSize$
maxEpochs = $totalEpochs$
#learning rate
learningRatesPerSample = 0.00000533 #modified to equal alpha
minLearningRatePerSample = 0.000001
#momentum
momentumAsTimeConstant = 300*1:1527*1:2741*1:13288
useNAG = true #nesterov momentum
#dropout
dropoutRate = 0.5
#L2RegWeight=0.0001
#L1RegWeight=0.0001
# Additional optional parameters are: distributedMBReading
parallelTrain = [
parallelizationMethod = "DataParallelSGD"
parallelizationStartEpoch = 1 #no warm-start
distributedMBReading = true
syncPerfStats = 10
# Additional optional parameters are: useZeroThresholdFor1BitQuantization
dataParallelSGD = [
gradientBits = 8
useBufferedAsyncGradientAggregation= false #true seems to cause perf penalty / wait
]
autoAdjust = [
autoAdjustLR = "adjustAfterEpoch"
reduceLearnRateIfImproveLessThan = 0
continueReduce = false
increaseLearnRateIfImproveMoreThan = 1000
loadBestModel = true
learnRateAdjustInterval = 1
learnRateDecreaseFactor = 0.36 #prior good value
learnRateIncreaseFactor = 1.382
numPrevLearnRates = 3
numBestSearchEpoch = 1
autoAdjustMinibatch = false
#minibatchSizeTuningFrequency = 2 # try to enlarge after this many epochs
#numMiniBatch4LRSearch = 200
#minibatchSizeTuningMax = 25000 # out of memory above this
]
]
gradUpdateType = "fsAdaGrad"
]
# reader configuration
reader = [
readerType = "CNTKTextFormatReader"
file = "$DataDir$/$m_testSet_Fname$"
randomize = true
randomizationWindow = $dataWindow$
chunkSizeInBytes = 1073741824 #1GB chunks
frameMode = true
keepDataInMemory = true
input = [
i = [
alias = "i"
dim = 65
format = "dense"
]
o = [
alias = "o"
dim = 13
format = "dense"
]
]
]
]
I can confirm this is not a reader issue. We have tried with the composite reader and RC2 throughput is still x12 times lower (than RC1). This leaves brainscript functions (per above model) as likely culprit.
@KeDengMS in RC1, switching to composite reader with multiThreadedDeserialization = true results in
May i suggest you update/revise the documentation to make clear that CNTKTextFormatReader is depricated - it surely is judging by performance and uniform load benefits. I do not know if you given up on updating Wiki, but i can see the cognitive toolkit web pages would benefit as well here and here. It is confusing to have documentation refer to depricated functionality. Instead it should show how to update the code.
In any case, many thanks for pointing out to composite reader and the work done on readers since early betas.
NB: reader configuration changed from
reader = [
readerType = "CNTKTextFormatReader"
file = "$DataDir$/$m_testSet_Fname$"
randomize = true
randomizationWindow = $dataWindow$
chunkSizeInBytes = 1073741824 #1GB chunks
frameMode = true
keepDataInMemory = true
input = [
i = [
alias = "i"
dim = 65
format = "dense"
]
o = [
alias = "o"
dim = 13
format = "dense"
]
]
]
to
reader = [
verbosity = 0
traceLevel = 0
precision = "float"
randomize = true
sampleBasedRandomizationWindow = true
multiThreadedDeserialization = true
randomizationWindow = $dataWindow$ #in samples
chunkSizeInBytes = 1073741824 #1GB chunks
frameMode = true
keepDataInMemory = true
deserializers = (
[
type = "CNTKTextFormatDeserializer"
module = "CNTKTextFormatReader"
file = "$DataDir$/$m_trainSet_Fname$"
input = [
i = [
alias = "i"
dim = 65
format = "dense"
]
o = [
alias = "o"
dim = 13
format = "dense"
]
]
]
)
]
Got a repro now. This is because of a bug fix in RC2 for asymmetric padding in convolution that falls back to slow reference engine. cudnn has a bug in asymmetric padding that the results are broken. I think padding on channel dimension is not intended right?
pool1: using cuDNN convolution engine for geometry: Input: 65 x 65 x 16, Output: 32 x 32 x 16, Kernel: 2 x 2 x 1, Map: 1, Stride: 2 x 2 x 1, Sharing: (1), AutoPad: (0), LowerPad: 0, UpperPad: 0. WARNING: Detected asymmetric padding issue with even kernel size and lowerPad (247) < higherPad (249) (i=2), cuDNN will not be able to produce correct result. Switch to reference engine (VERY SLOW). conv2_act.c: using reference convolution engine for geometry, could be VERY SLOW: Input: 32 x 32 x 16, Output: 32 x 32 x 32, Kernel: 5 x 5 x 16, Map: 32, Stride: 1 x 1 x 16, Sharing: (1, 1, 1), AutoPad: (1, 1, 1), LowerPad: 0 x 0 x 0, UpperPad: 0 x 0 x 0. Using cuDNN batch normalization engine.
I changed the script to fix the asymmetric padding and then the speed is the same as RC1, with this line changed: c = Convolution(w, inp, (kW:kH:inMap), mapDims=outMap, stride=(hStride:vStride:inMap), sharing = true, autoPadding = (true:true:false), lowerPad = 0, upperPad = 0)
@KeDengMS thank you! to be clear - does this invalidate any models trained with that script, RC1 with *autoPadding = true? in other words, should those models be retrained with RC2 and autoPadding = (true:true:false) ? references:
nvidia convolutionFFT2D sdk (developer.download.nvidia.com pdf) Caffe conv layer implementation (caffe.berkeleyvision.org)
This needs to be double checked.
I checked in a fix in the latest master. You should not see this issue again even if you are using the original script.
@cha-zhang thank you. we are rebuilding the test lab from win2012/msmpi-ND to ubuntu/openmpi-openib - so unable to check, but from commit b9ac8f38488ebb1cc0550ae3323313a52ef3a7af it looks fine.
I observe the following issue if I use 4x4 kernels input even input sizes with padding autopadding=[False, True, True]: WARNING: Detected asymmetric padding issue with even kernel size and lowerPad (1) < higherPad (2) (i=0), cuDNN will not be able to produce correct result. Switch to reference engine (VERY SLOW).
4x4 kernel is asymmetric by nature. Maybe you can pad it to 5x5 instead? @BowenBao for more suggestions.
There is a 12x performance deterioration running our models between RC1 - > RC2.
Looks MPI related given drastic increase in gradient aggregation (Windows/MSMPI/RDMA) On a positive note, GPU memory usage improved and so did the individual GPU loads.
RC1
RC2
Have any of the defaults for brainscipt changed? What happened? Model uses dataParallelSGD:
The hardware/software setup is exactly the same between RC1 and RC2. The only update had been replacing the cntk bin directory. Brainscipt models exactly same.