Open ericbolo opened 6 years ago
Great, thanks. Would you like to prepare a pull request to make sure we get thr correct fix?
Sure!
There's another change I'd like to make as I explain in #134 , also related to parallel training, but I'm looking for validation from core contributors on this one.
I'm away for a couple days, I'll make the pull requests once I return.
While running the tedlium script with train_ctc_parallel_h.sh, on a system with 8 GPUs, I found memory allocation errors in some tr.iter..log files.
I realized several jobs were selecting the same device and thus competing for RAM.
I saw no error on stdout, and it was only after a while that I saw the suggestion to turn on exclusive mode, somewhere near the top of the log file. Running the command "nvidia-smi -c 1" fixed the issue.
For future users who want to train models on multiple GPUs, I'd like to suggest making the suggestion more obvious somehow, maybe by throwing a warning visible in stdout?