tencent-ailab / pika

a lightweight speech processing toolkit based on Pytorch and (Py)Kaldi
Apache License 2.0
339 stars 57 forks source link

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED/subprocess.CalledProcessError: Command '['/root/anaconda3/bin/python', '-u #5

Open xbsdsongnan opened 3 years ago

xbsdsongnan commented 3 years ago

Traceback (most recent call last): File "/home/pika-main/trainer/train_transducer_bmuf_otfaug.py", line 305, in model.cuda(args.local_rank) File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 265, in cuda return self._apply(lambda t: t.cuda(device)) File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply module._apply(fn) File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 127, in _apply self.flatten_parameters() File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters self.batch_first, bool(self.bidirectional)) RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED Traceback (most recent call last): File "/root/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/root/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/anaconda3/lib/python3.6/site-packages/torch/distributed/launch.py", line 235, in main() File "/root/anaconda3/lib/python3.6/site-packages/torch/distributed/launch.py", line 231, in main cmd=process.args) subprocess.CalledProcessError: Command '['/root/anaconda3/bin/python', '-u', '/home/pika-main/trainer/train_transducer_bmuf_otfaug.py', '--local_rank=0', '--verbose', '--optim', 'sgd', '--initial_lr', '0.003', '--final_lr', '0.0001', '--grad_clip', '3.0', '--num_batches_per_epoch', '526264', '--num_epochs', '8', '--momentum', '0.9', '--block_momentum', '0.9', '--sync_period', '5', '--feats_dim', '80', '--cuda', '--batch_size', '8', '--encoder_type', 'transformer', '--enc_layers', '9', '--decoder_type', 'rnn', '--dec_layers', '2', '--rnn_type', 'LSTM', '--rnn_size', '1024', '--embd_dim', '100', '--dropout', '0.2', '--brnn', '--padding_idx', '6268', '--padding_tgt', '6268', '--stride', '1', '--queue_size', '8', '--loader', 'otf_utt', '--batch_first', '--cmn', '--cmvn_stats', '/home/pika/pika-main/egs/global_cmvn.stats', '--output_dim', '6268', '--num_workers', '1', '--sample_rate', '16000', '--feat_config', '/home/pika/pika-main/egs/fbank.conf', '--TU_limit', '15000', '--gain_range', '50,10', '--speed_rate', '0.9,1.0,1.1', '--log_per_n_frames', '131072', '--max_len', '1600', '--lctx', '1', '--rctx', '1', '--model_lctx', '21', '--model_rctx', '21', '--model_stride', '4', 'transducer', '/home/pika/pika-main/egs/lst/data.0.WORKER-ID.lst', '/home/pika/pika-main/egs/logs.baseline/train_transducer.0.WORKER-ID.log', '/home/pika/pika-main/egs/output/baseline.0']' returned non-zero exit status 1.

danpovey commented 3 years ago

I've seen it said that that error is fairly commonly and randomly found when you use LSTMs with PyTorch, particularly with some anaconda distributions... but I've also seen it said that that error can actually mask out of memory. Regardless, I doubt it is repeatable.

On Fri, Jan 22, 2021 at 4:34 PM xbsdsongnan notifications@github.com wrote:

Traceback (most recent call last): File "/home/pika-main/trainer/train_transducer_bmuf_otfaug.py", line 305, in model.cuda(args.local_rank) File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 265, in cuda return self._apply(lambda t: t.cuda(device)) File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply module._apply(fn) File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 127, in _apply self.flatten_parameters() File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters self.batch_first, bool(self.bidirectional)) RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED Traceback (most recent call last): File "/root/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/root/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/anaconda3/lib/python3.6/site-packages/torch/distributed/launch.py", line 235, in main() File "/root/anaconda3/lib/python3.6/site-packages/torch/distributed/launch.py", line 231, in main cmd=process.args) subprocess.CalledProcessError: Command '['/root/anaconda3/bin/python', '-u', '/home/pika-main/trainer/train_transducer_bmuf_otfaug.py', '--local_rank=0', '--verbose', '--optim', 'sgd', '--initial_lr', '0.003', '--final_lr', '0.0001', '--grad_clip', '3.0', '--num_batches_per_epoch', '526264', '--num_epochs', '8', '--momentum', '0.9', '--block_momentum', '0.9', '--sync_period', '5', '--feats_dim', '80', '--cuda', '--batch_size', '8', '--encoder_type', 'transformer', '--enc_layers', '9', '--decoder_type', 'rnn', '--dec_layers', '2', '--rnn_type', 'LSTM', '--rnn_size', '1024', '--embd_dim', '100', '--dropout', '0.2', '--brnn', '--padding_idx', '6268', '--padding_tgt', '6268', '--stride', '1', '--queue_size', '8', '--loader', 'otf_utt', '--batch_first', '--cmn', '--cmvn_stats', '/home/pika/pika-main/egs/global_cmvn.stats', '--output_dim', '6268', '--num_workers', '1', '--sample_rate', '16000', '--feat_config', '/home/pika/pika-main/egs/fbank.conf', '--TU_limit', '15000', '--gain_range', '50,10', '--speed_rate', '0.9,1.0,1.1', '--log_per_n_frames', '131072', '--max_len', '1600', '--lctx', '1', '--rctx', '1', '--model_lctx', '21', '--model_rctx', '21', '--model_stride', '4', 'transducer', '/home/pika/pika-main/egs/lst/data.0.WORKER-ID.lst', '/home/pika/pika-main/egs/logs.baseline/train_transducer.0.WORKER-ID.log', '/home/pika/pika-main/egs/output/baseline.0']' returned non-zero exit status 1.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tencent-ailab/pika/issues/5, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7IVLIPGR5THZHUUDLS3E2BDANCNFSM4WOEK37A .

cweng6 commented 3 years ago

Thanks, Dan.

@xbsdsongnan I believe this is most likely relevant to GPU OOM. Could you try lowering the 'TU_limit' value to reduce GPU memory usage? BTW, you might need to adjust some of your option such as '--padding_tgt', '--num_batches_per_epoch' instead of default values.

xbsdsongnan commented 3 years ago

Can pytorch1.1.0 and cuda10.0 work normally?@cweng6@danpovey

cweng6 commented 3 years ago

I believe so. I saw some stable available wheels for the installation here, https://download.pytorch.org/whl/torch_stable.html

xbsdsongnan commented 3 years ago

@cweng6 Traceback (most recent call last): File "/home/pika-main/trainer/train_transducer_bmuf_otfaug.py", line 305, in model.cuda(args.local_rank) File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 265, in cuda return self._apply(lambda t: t.cuda(device)) File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply module._apply(fn) File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 127, in _apply self.flatten_parameters() File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters self.batch_first, bool(self.bidirectional)) RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED Traceback (most recent call last): File "/root/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/root/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/anaconda3/lib/python3.6/site-packages/torch/distributed/launch.py", line 235, in main() File "/root/anaconda3/lib/python3.6/site-packages/torch/distributed/launch.py", line 231, in main cmd=process.args) subprocess.CalledProcessError: Command '['/root/anaconda3/bin/python', '-u', '/home/pika-main/trainer/train_transducer_bmuf_otfaug.py', '--local_rank=0', '--verbose', '--optim', 'sgd', '--initial_lr', '0.003', '--final_lr', '0.0001', '--grad_clip', '3.0', '--num_batches_per_epoch', '10', '--num_epochs', '2', '--momentum', '0.9', '--block_momentum', '0.9', '--sync_period', '5', '--feats_dim', '80', '--cuda', '--batch_size', '2', '--encoder_type', 'transformer', '--enc_layers', '9', '--decoder_type', 'rnn', '--dec_layers', '2', '--rnn_type', 'LSTM', '--rnn_size', '1024', '--embd_dim', '100', '--dropout', '0.2', '--brnn', '--padding_idx', '1', '--padding_tgt', '1', '--stride', '1', '--queue_size', '8', '--loader', 'otf_utt', '--batch_first', '--cmn', '--cmvn_stats', '/home/pika/pika-main/egs/global_cmvn.stats', '--output_dim', '6268', '--num_workers', '1', '--sample_rate', '16000', '--feat_config', '/home/pika/pika-main/egs/fbank.conf', '--TU_limit', '1', '--gain_range', '50,10', '--speed_rate', '0.9,1.0,1.1', '--log_per_n_frames', '131072', '--max_len', '1600', '--lctx', '1', '--rctx', '1', '--model_lctx', '21', '--model_rctx', '21', '--model_stride', '4', 'transducer', '/home/pika/pika-main/egs/lst/data.0.WORKER-ID.lst', '/home/pika/pika-main/egs/logs.baseline/train_transducer.0.WORKER-ID.log', '/home/pika/pika-main/egs/output/baseline.0']' returned non-zero exit status 1.

xbsdsongnan commented 3 years ago

@cweng6 I've adjusted a lot of parameters, but the above one is just one of them. No matter how I modify the parameters, I can't pass it. Do you have the configuration parameter settings for the basic demo

cweng6 commented 3 years ago

We could run with the config in the release example. set TU_limit to 1 will not load any utterances for training. Anyway, could you describe your environment, python/PyTorch/cuda version, number/spec of GPUs, etc

csukuangfj commented 3 years ago

The output of the following command should be helpful for describing the environment.

$ python3 -m torch.utils.collect_env
xbsdsongnan commented 3 years ago

@cweng6 @csukuangfj Collecting environment information... PyTorch version: 1.1.0 Is debug build: No CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 16.04.6 LTS GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609 CMake version: version 3.5.1

Python version: 3.6 Is CUDA available: Yes CUDA runtime version: 10.0.130 GPU models and configuration: GPU 0: GeForce RTX 2080 with Max-Q Design Nvidia driver version: 430.34 cuDNN version: /usr/local/cuda-9.0/targets/x86_64-linux/lib/libcudnn.so.7

Versions of relevant libraries: [pip3] numpy==1.17.4 [pip3] numpydoc==0.8.0 [pip3] torch==1.1.0 [pip3] torchvision==0.3.0 [conda] blas 1.0 mkl
[conda] mkl 2018.0.2 1
[conda] mkl-service 1.1.2 py36h17a0993_4
[conda] mkl_fft 1.0.1 py36h3010b51_0
[conda] mkl_random 1.0.1 py36h629b387_0
[conda] torch 1.1.0 [conda] torchvision 0.3.0

xbsdsongnan commented 3 years ago

python3.6 cuda==10.0 torch==1.1.0 gpu==1

cweng6 commented 3 years ago

Thanks, Fangjun.

@xbsdsongnan , looks like the version of Cuda used to build pytorch doesn't match the one used in runtime.

Also, I am not sure the example script could run with one GPU. We will release an example using 1GPU later on.

xbsdsongnan commented 3 years ago

@cweng6 Thanks, Wengchao Learn from you

xbsdsongnan commented 3 years ago

@cweng6 I have eight GPUs on my server, but I really want to run on one GPU

xbsdsongnan commented 3 years ago

Collecting environment information... PyTorch version: 1.1.0 Is debug build: No CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 16.04.6 LTS GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609 CMake version: version 3.5.1

Python version: 3.6 Is CUDA available: Yes CUDA runtime version: 9.0.176 GPU models and configuration: GPU 0: GeForce RTX 2080 with Max-Q Design Nvidia driver version: 430.34 cuDNN version: /usr/local/cuda-9.0/targets/x86_64-linux/lib/libcudnn.so.7

Versions of relevant libraries: [pip3] numpy==1.14.3 [pip3] numpydoc==0.8.0 [pip3] torch==1.1.0 [pip3] torchvision==0.3.0 [conda] blas 1.0 mkl
[conda] mkl 2018.0.2 1
[conda] mkl-service 1.1.2 py36h17a0993_4
[conda] mkl_fft 1.0.1 py36h3010b51_0
[conda] mkl_random 1.0.1 py36h629b387_0
[conda] torch 1.1.0 [conda] torchvision 0.3.0

xbsdsongnan commented 3 years ago

@cweng6 filenotfounderror:[error2]no such file or directory:/home/pika/egs/arks/train.0.2.mrk.0

cweng6 commented 3 years ago

can you locate the needed mrk file? if not, there must be something wrong with the data preparation step.

xbsdsongnan commented 3 years ago

label.txt: BAC009S0764W0121 中国 实现 民族 复兴 wav.scp BAC009S0764W0121 /home/pika/data/test/S0764/BAC009S0764W0121.wav

xbsdsongnan commented 3 years ago

@cweng6 My data preparation sample Is there a problem

xbsdsongnan commented 3 years ago

@cweng6

Why can't I run the demo you released on four GPUs? What are the parameters of your demo that need to be modified? What is the version configuration environment

cweng6 commented 3 years ago

label.txt: BAC009S0764W0121 中国 实现 民族 复兴 wav.scp BAC009S0764W0121 /home/pika/data/test/S0764/BAC009S0764W0121.wav

Your label.txt doesn't look right. Check our project README,

label.txt: label text file, the format is, uttid sequence-of-integer, where integer is one-based indexing mapped label, note that zero is reserved for blank,eg., utt_id_1 3 5 7 10 23

You will need to map each character in transcription to an integer when preparing label.txt