srvk / eesen

The official repository of the Eesen project
http://arxiv.org/abs/1507.08240
Apache License 2.0
824 stars 342 forks source link

parallel training: make suggestion to use the nvidia "exclusive compute" option more obvious? #143

Open ericbolo opened 6 years ago

ericbolo commented 6 years ago

While running the tedlium script with train_ctc_parallel_h.sh, on a system with 8 GPUs, I found memory allocation errors in some tr.iter..log files.

I realized several jobs were selecting the same device and thus competing for RAM.

I saw no error on stdout, and it was only after a while that I saw the suggestion to turn on exclusive mode, somewhere near the top of the log file. Running the command "nvidia-smi -c 1" fixed the issue.

For future users who want to train models on multiple GPUs, I'd like to suggest making the suggestion more obvious somehow, maybe by throwing a warning visible in stdout?

fmetze commented 6 years ago

Great, thanks. Would you like to prepare a pull request to make sure we get thr correct fix?

ericbolo commented 6 years ago

Sure!

There's another change I'd like to make as I explain in #134 , also related to parallel training, but I'm looking for validation from core contributors on this one.

I'm away for a couple days, I'll make the pull requests once I return.