microsoft / CNTK

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit
https://docs.microsoft.com/cognitive-toolkit/
Other
17.52k stars 4.28k forks source link

Multi GPU usage is not optimal #456

Closed abbasov closed 7 years ago

abbasov commented 8 years ago

Hi I am running CNTK for DNN training on machine with 2 identical GPUs. I am using mostly default values in config, but GPU utilization is not optimal. SamplesPerSeconds drops 7-8 times after each approx. 100 batches during training. GPU usage drops to 0% during this. I have tried different batch sizes from 512 to 4096 with no success. Any ideas? Thanks for help


Epoch[ 9 of 24]-Minibatch[46551-46560, 0.0000000010338%]: SamplesSeen = 10240; TrainLossPerSample =  1.91940047; EvalErr[0]PerSample = 0.47734375; TotalTime = 1.5636s; SamplesPerSecond = 6548.8
**
 Epoch[ 9 of 24]-Minibatch[46561-46570, 0.0000000010341%]: SamplesSeen = 10240; TrainLossPerSample =  1.91332975; EvalErr[0]PerSample = 0.47773437; TotalTime = 9.3897s; SamplesPerSecond = 1090.6

 Epoch[ 9 of 24]-Minibatch[46561-46570, 0.0000000010341%]: SamplesSeen = 10240; TrainLossPerSample =  1.91332975; EvalErr[0]PerSample = 0.47773437; TotalTime = 9.3898s; SamplesPerSecond = 1090.5**

 Epoch[ 9 of 24]-Minibatch[46571-46580, 0.0000000010343%]: SamplesSeen = 10240; TrainLossPerSample =  1.89257326; EvalErr[0]PerSample = 0.47890625; TotalTime = 1.5189s; SamplesPerSecond = 6741.6
 Epoch[ 9 of 24]-Minibatch[46571-46580, 0.0000000010343%]: SamplesSeen = 10240; TrainLossPerSample =  1.89257326; EvalErr[0]PerSample = 0.47890625; TotalTime = 1.5190s; SamplesPerSecond = 6741.2
 Epoch[ 9 of 24]-Minibatch[46581-46590, 0.0000000010345%]: SamplesSeen = 10240; TrainLossPerSample =  1.91096219; EvalErr[0]PerSample = 0.47636719; TotalTime = 1.5315s; SamplesPerSecond = 6686.4
 Epoch[ 9 of 24]-Minibatch[46581-46590, 0.0000000010345%]: SamplesSeen = 10240; TrainLossPerSample =  1.91096219; EvalErr[0]PerSample = 0.47636719; TotalTime = 1.5316s; SamplesPerSecond = 6685.9
 Epoch[ 9 of 24]-Minibatch[46591-46600, 0.0000000010347%]: SamplesSeen = 10240; TrainLossPerSample =  1.91769625; EvalErr[0]PerSample = 0.47734375; TotalTime = 1.5540s; SamplesPerSecond = 6589.4
 Epoch[ 9 of 24]-Minibatch[46591-46600, 0.0000000010347%]: SamplesSeen = 10240; TrainLossPerSample =  1.91769625; EvalErr[0]PerSample = 0.47734375; TotalTime = 1.5553s; SamplesPerSecond = 6584.0
 Epoch[ 9 of 24]-Minibatch[46601-46610, 0.0000000010349%]: SamplesSeen = 10240; TrainLossPerSample =  1.90895075; EvalErr[0]PerSample = 0.47871094; TotalTime = 1.5213s; SamplesPerSecond = 6731.2
 Epoch[ 9 of 24]-Minibatch[46601-46610, 0.0000000010349%]: SamplesSeen = 10240; TrainLossPerSample =  1.90895075; EvalErr[0]PerSample = 0.47871094; TotalTime = 1.5200s; SamplesPerSecond = 6736.8
 Epoch[ 9 of 24]-Minibatch[46611-46620, 0.0000000010352%]: SamplesSeen = 10240; TrainLossPerSample =  1.90400652; EvalErr[0]PerSample = 0.47587891; TotalTime = 1.4767s; SamplesPerSecond = 6934.5
 Epoch[ 9 of 24]-Minibatch[46611-46620, 0.0000000010352%]: SamplesSeen = 10240; TrainLossPerSample =  1.90400652; EvalErr[0]PerSample = 0.47587891; TotalTime = 1.4767s; SamplesPerSecond = 6934.2
 Epoch[ 9 of 24]-Minibatch[46621-46630, 0.0000000010354%]: SamplesSeen = 10240; TrainLossPerSample =  1.91223264; EvalErr[0]PerSample = 0.47929688; TotalTime = 1.5408s; SamplesPerSecond = 6646.0
 Epoch[ 9 of 24]-Minibatch[46621-46630, 0.0000000010354%]: SamplesSeen = 10240; TrainLossPerSample =  1.91223264; EvalErr[0]PerSample = 0.47929688; TotalTime = 1.5420s; SamplesPerSecond = 6640.7
 Epoch[ 9 of 24]-Minibatch[46631-46640, 0.0000000010356%]: SamplesSeen = 10240; TrainLossPerSample =  1.90372097; EvalErr[0]PerSample = 0.47382812; TotalTime = 1.5362s; SamplesPerSecond = 6665.7
 Epoch[ 9 of 24]-Minibatch[46631-46640, 0.0000000010356%]: SamplesSeen = 10240; TrainLossPerSample =  1.90372097; EvalErr[0]PerSample = 0.47382812; TotalTime = 1.5350s; SamplesPerSecond = 6671.1
**
 Epoch[ 9 of 24]-Minibatch[46641-46650, 0.0000000010358%]: SamplesSeen = 10240; TrainLossPerSample =  1.91011578; EvalErr[0]PerSample = 0.47558594; TotalTime = 9.1696s; SamplesPerSecond = 1116.7

 Epoch[ 9 of 24]-Minibatch[46641-46650, 0.0000000010358%]: SamplesSeen = 10240; TrainLossPerSample =  1.91011578; EvalErr[0]PerSample = 0.47558594; TotalTime = 9.1706s; SamplesPerSecond = 1116.6**

 Epoch[ 9 of 24]-Minibatch[46651-46660, 0.0000000010361%]: SamplesSeen = 10240; TrainLossPerSample =  1.88578277; EvalErr[0]PerSample = 0.47119141; TotalTime = 1.4771s; SamplesPerSecond = 6932.6

 Epoch[ 9 of 24]-Minibatch[46651-46660, 0.0000000010361%]: SamplesSeen = 10240; TrainLossPerSample =  1.88578277; EvalErr[0]PerSample = 0.47119141; TotalTime = 1.4761s; SamplesPerSecond = 6937.3
 Epoch[ 9 of 24]-Minibatch[46661-46670, 0.0000000010363%]: SamplesSeen = 10240; TrainLossPerSample =  1.90368257; EvalErr[0]PerSample = 0.48027344; TotalTime = 1.5095s; SamplesPerSecond = 6783.9

This is my config:


# Parameters can be overwritten on the command line
# for example: cntk configFile=myConfigFile RootDir=../.. 
# For running from Visual Studio add
# currentDirectory=$(SolutionDir)/<path to corresponding data folder> 
RootDir = "."

ConfigDir = "$RootDir$"
DataDir = "$RootDir$"
OutputDir = "$RootDir$/output"
ModelDir = "$OutputDir$/models_500h"

# deviceId=-1 for CPU, >=0 for GPU devices, "auto" chooses the best GPU, or CPU if no usable GPU is available
deviceId = auto

command = speechTrain

precision = "float"
traceLevel = "1"
modelPath = "$ModelDir$/cntkSpeechFF.dnn"
outputPath = "$ModelDir$"
parallelTrain = true

#######################################
#  TRAINING CONFIG                    #
#######################################

speechTrain = [
    action = "train"
    makeMode = true
    SimpleNetworkBuilder = [
        layerSizes = 1320:2048*5:6057
        trainingCriterion = "CrossEntropyWithSoftmax"
        evalCriterion = "ErrorPrediction"
        layerTypes = "Sigmoid"
        applyMeanVarNorm = true
        needPrior = true
    addDropoutNodes = true

    ]

    SGD = [
    dropoutRate = 0.1
        epochSize = 0
    #maxTempMemSizeInSamplesForCNN = 2000000
        minibatchSize = 4096
        learningRatesPerMB = 0.01:0.8*3:0.125
        numMBsToShowResult = 10
    #momentumPerMB = 0.9:0.656119
    momentumPerMB = 0
        maxEpochs = 24
        keepCheckPointFiles = true

        # Additional optional parameters are: parallelizationStartEpoch
        parallelTrain = [
            parallelizationMethod = "DataParallelSGD"
            distributedMBReading = true

            # Additional optional parameters are: useZeroThresholdFor1BitQuantization
            dataParallelSGD = [
                gradientBits = 1
            ]
        ]

        AutoAdjust = [
            autoAdjustMinibatch = true
            minibatchSizeTuningFrequency = 1
            minibatchSizeTuningMax = 4096
            minibatchSearchCriterionErrorMargin = 2
            autoAdjustLR="adjustAfterEpoch"
            reduceLearnRateIfImproveLessThan=0.05
            loadBestModel=true
            increaseLearnRateIfImproveMoreThan=1000000000
            learnRateDecreaseFactor=0.5
            learnRateIncreaseFactor=1.382
        ]   
    ]

    reader = [
        readerType = "HTKMLFReader"
        readMethod = blockRandomize
        miniBatchMode = "partial"
        randomize = 6000000
        verbosity = 0

        features = [ 
            dim = 1320
            type = "real"
            scpFile = "$DataDir$/train500h.scp"
        ]

        labels = [
            mlfFile = "$DataDir$/train-tr-500h-cntk.mlf"
            labelMappingFile = "$DataDir$/state.list"
            labelDim = 6057
        ]
    ]

   cvReader=[
      # reader to use
      readerType=HTKMLFReader
      readMethod=blockRandomize
      miniBatchMode=Partial
      randomize = 6000000
      verbosity=0   
      features=[
        dim=1320
            type = real
        scpFile = $DataDir$/valid500h.scp
      ]

      labels=[
        mlfFile=$DataDir$/valid-tr-500h-cntk.mlf
        labelDim=6057
        labelMappingFile = $DataDir$/state.list
        labelType=Category
      ]
    ]

]
dongyu888 commented 8 years ago

Are you using sequence level training (i.e., framemode=false)? If that’s the case, due to the sequence length differences, some parallel utterances will finish earlier and some later. Around the end of each epoch you should see less samples (identified as samplesSeen in the log) being processed. The actually processing speed is not changed just the useful samples processed are reduced (since some utterances have already finished processing).

abbasov commented 8 years ago

We have not set framemode=false We are training DNN, not LSTM, and the number of samples isn't reduced only at the end of epoch, but periodically after approx. each 100 minibatches. Also, totaltime is increased from 1.3 sec to 9.5 sec.

dongyu888 commented 8 years ago

If you are training DNN you should set framemode=false.

To speed up sequence model training, we group multiple utterances together. Depends on whether you are using truncated BPTT (can pack better) or not (you may see different samples for each minibatch since the max length may be different) the behavior will be slightly different but in both cases you will compute some blank samples due to utterance length difference.

abbasov commented 8 years ago

I am very sorry but I'm confused. CNTK Book says

Setting frameMode to true is the default and is appropriate for training networks without any temporal connections.

I appreciate your time, thank you very much for support

dongyu888 commented 8 years ago

Sorry, I meant to say you should set frameMode=true for DNNs. If frameMode=true you should not see variable effective samples processed.

abbasov commented 8 years ago

Yes, I haven't set it, but CNTK Book says the default is true. So what can cause the problem? Note that I have downloaded binaries from here:
https://github.com/Microsoft/CNTK/releases/tag/r2016-02-08 I have noticed that this also happens when training on 1 GPU.

dongyu888 commented 8 years ago

Oh, that’s a very old version. Would you mind trying the newer version from https://github.com/Microsoft/CNTK/releases/tag/v1.1

abbasov commented 8 years ago

I will try it. Thank you very much for help and support!

abbasov commented 8 years ago

Hi I have tried the last version of CNTK with framemode=true option. It didn't solve the problem. Again after each approximately 100 minibatches GPUs stop for 8 seconds and continues to work. It is observable from nvidia-smi command. I also have tried rollingWindow reader option. The epoch time reduced from 13 hours to 7 hours and GPUs continuously worked. But if I stop the training and start for the last epoch it creates new /tmp/temp.CNTK.xxx file. And it takes very long time. How can I use the same temp file for the rest of training?

frankseide commented 8 years ago

Do you see log messages associated with the 8 seconds stop, such as "recoverblock"? What you may be seeing is just the loading of upcoming chunks. If your data is on a file server, this operation will depend on your network's and the file server's capacity/load. The rollingWindow source does not load data from the network during training, since it makes a full local copy first, as you have observed. One way to check would be to check your network usage, and maybe also the usage of your file server during these 8 seconds.

Normally, you should see these slow-downs mostly during startup the initial minibatches, as it needs to load a lot of data upfront to fill the window. After a while, it should smooth out quite a bit.

The correct way to solve this is to prefetch the data on a parallel thread. The code existed in my original tool that these readers were taken from, but I need to check with the team whether this was enabled when the readers were ported to CNTK.

abbasov commented 8 years ago

We don't use network and file server. We have two identical GPUs installed on a single machine. I have started training without verbosity so I haven't observed such situation yet. But I observed high using of disk (probably reading) during this stopping time. It would be wonderful if you release this code. It could reduce the training time approximately 40%.

mahilleb-msft commented 7 years ago

HTKDeserializers (an HTKMLFReader replacement) supports prefetching of chunks. Please check out https://docs.microsoft.com/en-us/cognitive-toolkit/brainscript-and-python---understanding-and-extending-readers. In case it doesn't work please let us know... Thanks, Mark

mahilleb-msft commented 7 years ago

Closing as answered. Please re-open or file a new issue if necessary. Thanks!