Closed abbasov closed 7 years ago
Are you using sequence level training (i.e., framemode=false)? If that’s the case, due to the sequence length differences, some parallel utterances will finish earlier and some later. Around the end of each epoch you should see less samples (identified as samplesSeen in the log) being processed. The actually processing speed is not changed just the useful samples processed are reduced (since some utterances have already finished processing).
We have not set framemode=false We are training DNN, not LSTM, and the number of samples isn't reduced only at the end of epoch, but periodically after approx. each 100 minibatches. Also, totaltime is increased from 1.3 sec to 9.5 sec.
If you are training DNN you should set framemode=false.
To speed up sequence model training, we group multiple utterances together. Depends on whether you are using truncated BPTT (can pack better) or not (you may see different samples for each minibatch since the max length may be different) the behavior will be slightly different but in both cases you will compute some blank samples due to utterance length difference.
I am very sorry but I'm confused. CNTK Book says
Setting frameMode to true is the default and is appropriate for training networks without any temporal connections.
I appreciate your time, thank you very much for support
Sorry, I meant to say you should set frameMode=true for DNNs. If frameMode=true you should not see variable effective samples processed.
Yes, I haven't set it, but CNTK Book says the default is true. So what can cause the problem? Note that I have downloaded binaries from here:
https://github.com/Microsoft/CNTK/releases/tag/r2016-02-08
I have noticed that this also happens when training on 1 GPU.
Oh, that’s a very old version. Would you mind trying the newer version from https://github.com/Microsoft/CNTK/releases/tag/v1.1
I will try it. Thank you very much for help and support!
Hi I have tried the last version of CNTK with framemode=true option. It didn't solve the problem. Again after each approximately 100 minibatches GPUs stop for 8 seconds and continues to work. It is observable from nvidia-smi command. I also have tried rollingWindow reader option. The epoch time reduced from 13 hours to 7 hours and GPUs continuously worked. But if I stop the training and start for the last epoch it creates new /tmp/temp.CNTK.xxx file. And it takes very long time. How can I use the same temp file for the rest of training?
Do you see log messages associated with the 8 seconds stop, such as "recoverblock"? What you may be seeing is just the loading of upcoming chunks. If your data is on a file server, this operation will depend on your network's and the file server's capacity/load. The rollingWindow source does not load data from the network during training, since it makes a full local copy first, as you have observed. One way to check would be to check your network usage, and maybe also the usage of your file server during these 8 seconds.
Normally, you should see these slow-downs mostly during startup the initial minibatches, as it needs to load a lot of data upfront to fill the window. After a while, it should smooth out quite a bit.
The correct way to solve this is to prefetch the data on a parallel thread. The code existed in my original tool that these readers were taken from, but I need to check with the team whether this was enabled when the readers were ported to CNTK.
We don't use network and file server. We have two identical GPUs installed on a single machine. I have started training without verbosity so I haven't observed such situation yet. But I observed high using of disk (probably reading) during this stopping time. It would be wonderful if you release this code. It could reduce the training time approximately 40%.
HTKDeserializers (an HTKMLFReader replacement) supports prefetching of chunks. Please check out https://docs.microsoft.com/en-us/cognitive-toolkit/brainscript-and-python---understanding-and-extending-readers. In case it doesn't work please let us know... Thanks, Mark
Closing as answered. Please re-open or file a new issue if necessary. Thanks!
Hi I am running CNTK for DNN training on machine with 2 identical GPUs. I am using mostly default values in config, but GPU utilization is not optimal. SamplesPerSeconds drops 7-8 times after each approx. 100 batches during training. GPU usage drops to 0% during this. I have tried different batch sizes from 512 to 4096 with no success. Any ideas? Thanks for help
This is my config: