Open xbsdsongnan opened 3 years ago
I've seen it said that that error is fairly commonly and randomly found when you use LSTMs with PyTorch, particularly with some anaconda distributions... but I've also seen it said that that error can actually mask out of memory. Regardless, I doubt it is repeatable.
On Fri, Jan 22, 2021 at 4:34 PM xbsdsongnan notifications@github.com wrote:
Traceback (most recent call last): File "/home/pika-main/trainer/train_transducer_bmuf_otfaug.py", line 305, in model.cuda(args.local_rank) File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 265, in cuda return self._apply(lambda t: t.cuda(device)) File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply module._apply(fn) File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 127, in _apply self.flatten_parameters() File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters self.batch_first, bool(self.bidirectional)) RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED Traceback (most recent call last): File "/root/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/root/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/anaconda3/lib/python3.6/site-packages/torch/distributed/launch.py", line 235, in main() File "/root/anaconda3/lib/python3.6/site-packages/torch/distributed/launch.py", line 231, in main cmd=process.args) subprocess.CalledProcessError: Command '['/root/anaconda3/bin/python', '-u', '/home/pika-main/trainer/train_transducer_bmuf_otfaug.py', '--local_rank=0', '--verbose', '--optim', 'sgd', '--initial_lr', '0.003', '--final_lr', '0.0001', '--grad_clip', '3.0', '--num_batches_per_epoch', '526264', '--num_epochs', '8', '--momentum', '0.9', '--block_momentum', '0.9', '--sync_period', '5', '--feats_dim', '80', '--cuda', '--batch_size', '8', '--encoder_type', 'transformer', '--enc_layers', '9', '--decoder_type', 'rnn', '--dec_layers', '2', '--rnn_type', 'LSTM', '--rnn_size', '1024', '--embd_dim', '100', '--dropout', '0.2', '--brnn', '--padding_idx', '6268', '--padding_tgt', '6268', '--stride', '1', '--queue_size', '8', '--loader', 'otf_utt', '--batch_first', '--cmn', '--cmvn_stats', '/home/pika/pika-main/egs/global_cmvn.stats', '--output_dim', '6268', '--num_workers', '1', '--sample_rate', '16000', '--feat_config', '/home/pika/pika-main/egs/fbank.conf', '--TU_limit', '15000', '--gain_range', '50,10', '--speed_rate', '0.9,1.0,1.1', '--log_per_n_frames', '131072', '--max_len', '1600', '--lctx', '1', '--rctx', '1', '--model_lctx', '21', '--model_rctx', '21', '--model_stride', '4', 'transducer', '/home/pika/pika-main/egs/lst/data.0.WORKER-ID.lst', '/home/pika/pika-main/egs/logs.baseline/train_transducer.0.WORKER-ID.log', '/home/pika/pika-main/egs/output/baseline.0']' returned non-zero exit status 1.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tencent-ailab/pika/issues/5, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7IVLIPGR5THZHUUDLS3E2BDANCNFSM4WOEK37A .
Thanks, Dan.
@xbsdsongnan I believe this is most likely relevant to GPU OOM. Could you try lowering the 'TU_limit' value to reduce GPU memory usage? BTW, you might need to adjust some of your option such as '--padding_tgt', '--num_batches_per_epoch' instead of default values.
Can pytorch1.1.0 and cuda10.0 work normally?@cweng6@danpovey
I believe so. I saw some stable available wheels for the installation here, https://download.pytorch.org/whl/torch_stable.html
@cweng6
Traceback (most recent call last):
File "/home/pika-main/trainer/train_transducer_bmuf_otfaug.py", line 305, in
@cweng6 I've adjusted a lot of parameters, but the above one is just one of them. No matter how I modify the parameters, I can't pass it. Do you have the configuration parameter settings for the basic demo
We could run with the config in the release example. set TU_limit to 1 will not load any utterances for training. Anyway, could you describe your environment, python/PyTorch/cuda version, number/spec of GPUs, etc
The output of the following command should be helpful for describing the environment.
$ python3 -m torch.utils.collect_env
@cweng6 @csukuangfj Collecting environment information... PyTorch version: 1.1.0 Is debug build: No CUDA used to build PyTorch: 9.0.176
OS: Ubuntu 16.04.6 LTS GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609 CMake version: version 3.5.1
Python version: 3.6 Is CUDA available: Yes CUDA runtime version: 10.0.130 GPU models and configuration: GPU 0: GeForce RTX 2080 with Max-Q Design Nvidia driver version: 430.34 cuDNN version: /usr/local/cuda-9.0/targets/x86_64-linux/lib/libcudnn.so.7
Versions of relevant libraries:
[pip3] numpy==1.17.4
[pip3] numpydoc==0.8.0
[pip3] torch==1.1.0
[pip3] torchvision==0.3.0
[conda] blas 1.0 mkl
[conda] mkl 2018.0.2 1
[conda] mkl-service 1.1.2 py36h17a0993_4
[conda] mkl_fft 1.0.1 py36h3010b51_0
[conda] mkl_random 1.0.1 py36h629b387_0
[conda] torch 1.1.0
python3.6 cuda==10.0 torch==1.1.0 gpu==1
Thanks, Fangjun.
@xbsdsongnan , looks like the version of Cuda used to build pytorch doesn't match the one used in runtime.
Also, I am not sure the example script could run with one GPU. We will release an example using 1GPU later on.
@cweng6 Thanks, Wengchao Learn from you
@cweng6 I have eight GPUs on my server, but I really want to run on one GPU
Collecting environment information... PyTorch version: 1.1.0 Is debug build: No CUDA used to build PyTorch: 9.0.176
OS: Ubuntu 16.04.6 LTS GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609 CMake version: version 3.5.1
Python version: 3.6 Is CUDA available: Yes CUDA runtime version: 9.0.176 GPU models and configuration: GPU 0: GeForce RTX 2080 with Max-Q Design Nvidia driver version: 430.34 cuDNN version: /usr/local/cuda-9.0/targets/x86_64-linux/lib/libcudnn.so.7
Versions of relevant libraries:
[pip3] numpy==1.14.3
[pip3] numpydoc==0.8.0
[pip3] torch==1.1.0
[pip3] torchvision==0.3.0
[conda] blas 1.0 mkl
[conda] mkl 2018.0.2 1
[conda] mkl-service 1.1.2 py36h17a0993_4
[conda] mkl_fft 1.0.1 py36h3010b51_0
[conda] mkl_random 1.0.1 py36h629b387_0
[conda] torch 1.1.0
@cweng6 filenotfounderror:[error2]no such file or directory:/home/pika/egs/arks/train.0.2.mrk.0
can you locate the needed mrk file? if not, there must be something wrong with the data preparation step.
label.txt: BAC009S0764W0121 中国 实现 民族 复兴 wav.scp BAC009S0764W0121 /home/pika/data/test/S0764/BAC009S0764W0121.wav
@cweng6 My data preparation sample Is there a problem
@cweng6
Why can't I run the demo you released on four GPUs? What are the parameters of your demo that need to be modified? What is the version configuration environment
label.txt: BAC009S0764W0121 中国 实现 民族 复兴 wav.scp BAC009S0764W0121 /home/pika/data/test/S0764/BAC009S0764W0121.wav
Your label.txt doesn't look right. Check our project README,
label.txt: label text file, the format is, uttid sequence-of-integer, where integer is one-based indexing mapped label, note that zero is reserved for blank,eg., utt_id_1 3 5 7 10 23
You will need to map each character in transcription to an integer when preparing label.txt
Traceback (most recent call last): File "/home/pika-main/trainer/train_transducer_bmuf_otfaug.py", line 305, in
model.cuda(args.local_rank)
File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 265, in cuda
return self._apply(lambda t: t.cuda(device))
File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
module._apply(fn)
File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 127, in _apply
self.flatten_parameters()
File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters
self.batch_first, bool(self.bidirectional))
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/root/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/anaconda3/lib/python3.6/site-packages/torch/distributed/launch.py", line 235, in
main()
File "/root/anaconda3/lib/python3.6/site-packages/torch/distributed/launch.py", line 231, in main
cmd=process.args)
subprocess.CalledProcessError: Command '['/root/anaconda3/bin/python', '-u', '/home/pika-main/trainer/train_transducer_bmuf_otfaug.py', '--local_rank=0', '--verbose', '--optim', 'sgd', '--initial_lr', '0.003', '--final_lr', '0.0001', '--grad_clip', '3.0', '--num_batches_per_epoch', '526264', '--num_epochs', '8', '--momentum', '0.9', '--block_momentum', '0.9', '--sync_period', '5', '--feats_dim', '80', '--cuda', '--batch_size', '8', '--encoder_type', 'transformer', '--enc_layers', '9', '--decoder_type', 'rnn', '--dec_layers', '2', '--rnn_type', 'LSTM', '--rnn_size', '1024', '--embd_dim', '100', '--dropout', '0.2', '--brnn', '--padding_idx', '6268', '--padding_tgt', '6268', '--stride', '1', '--queue_size', '8', '--loader', 'otf_utt', '--batch_first', '--cmn', '--cmvn_stats', '/home/pika/pika-main/egs/global_cmvn.stats', '--output_dim', '6268', '--num_workers', '1', '--sample_rate', '16000', '--feat_config', '/home/pika/pika-main/egs/fbank.conf', '--TU_limit', '15000', '--gain_range', '50,10', '--speed_rate', '0.9,1.0,1.1', '--log_per_n_frames', '131072', '--max_len', '1600', '--lctx', '1', '--rctx', '1', '--model_lctx', '21', '--model_rctx', '21', '--model_stride', '4', 'transducer', '/home/pika/pika-main/egs/lst/data.0.WORKER-ID.lst', '/home/pika/pika-main/egs/logs.baseline/train_transducer.0.WORKER-ID.log', '/home/pika/pika-main/egs/output/baseline.0']' returned non-zero exit status 1.