openspeech-team / openspeech

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.
https://openspeech-team.github.io/openspeech/
MIT License
677 stars 114 forks source link

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #94

Open jun-danieloh opened 3 years ago

jun-danieloh commented 3 years ago

❓ Questions & Help

I am facing the issue below. I saw a previous post related to this issue but exact workaround wasn't shared so I am posting this issue again. Could you please help me on this issue? Any suggestions?

Details

Environment: Ubuntu 18.04 Docker Env(FROM pytorch/pytorch:1.7.0-cuda11.0-cudnn8-devel)

Logs:

[2021-09-07 12:43:51,510][openspeech.utils][INFO] - Operating System : Linux 5.11.0-27-generic
[2021-09-07 12:43:51,510][openspeech.utils][INFO] - Processor : x86_64
[2021-09-07 12:43:51,510][openspeech.utils][INFO] - device : NVIDIA GeForce RTX 2080 Ti
[2021-09-07 12:43:51,510][openspeech.utils][INFO] - CUDA is available : True
[2021-09-07 12:43:51,510][openspeech.utils][INFO] - CUDA version : 11.0
[2021-09-07 12:43:51,511][openspeech.utils][INFO] - PyTorch version : 1.7.0
/opt/conda/lib/python3.8/site-packages/torchaudio/backend/utils.py:53: UserWarning: "sox" backend is being deprecated. The default backend will be changed to "sox_io" backend in 0.8.0 and "sox" backend will be removed in 0.9.0. Please mi
grate to "sox_io" backend. Please refer to https://github.com/pytorch/audio/issues/903 for the detail.
  warnings.warn(
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/core/datamodule.py:423: LightningDeprecationWarning: DataModule.setup has already been called, so it will not be called again. In v1.6 this behavior will change to always call Data
Module.setup.
  rank_zero_deprecation(
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
...
...
Epoch 0:   4%| | 1450/35504 [09:05<3:33:34,  2.66it/s, loss=nan, v_num=abtb, train_loss=nan.0, train_wer=0.901, train_Error executing job with overrides: ['dataset.dataset_path=/home/jun1.oh/', 'dataset.dataset_download=False', 'dataset.
manifest_file_path=/home/jun1.oh/LibriSpeech_daniel/libri_subword_manifest.txt', 'tokenizer.vocab_path=/home/jun1.oh/LibriSpeech_daniel/', 'trainer.batch_size=16']
Traceback (most recent call last):
  File "./openspeech_cli/hydra_train.py", line 60, in <module>
    hydra_main()
  File "/opt/conda/lib/python3.8/site-packages/hydra/main.py", line 48, in decorated_main
    _run_hydra(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
    run_and_report(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
    raise ex
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
    lambda: hydra.run(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
    _ = ret.return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "./openspeech_cli/hydra_train.py", line 54, in hydra_main
    trainer.fit(model, data_module)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 552, in fit
    self._run(model)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 917, in _run
    self._dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 985, in _dispatch
    self.accelerator.start_training(self)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
    self._results = trainer.run_stage()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 995, in run_stage
    return self._run_train()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1044, in _run_train
    self.fit_loop.run()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance
    epoch_output = self.epoch_loop.run(train_dataloader)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 147, in advance
    result = self._run_optimization(batch_idx, split_batch, opt_idx, optimizer)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 201, in _run_optimization
    self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 395, in _optimizer_step
    model_ref.optimizer_step(
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1618, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 209, in step
    self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 129, in __optimizer_step
    trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 296, in optimizer_step
    self.run_optimizer_step(optimizer, opt_idx, lambda_closure, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 303, in run_optimizer_step
    self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 226, in optimizer_step
    optimizer.step(closure=lambda_closure, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/optim/adam.py", line 66, in step
    loss = closure()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 235, in _training_step_and_backward_closure
    result = self.training_step_and_backward(split_batch, batch_idx, opt_idx, optimizer, hiddens)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 536, in training_step_and_backward
    result = self._training_step(split_batch, batch_idx, opt_idx, hiddens)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 306, in _training_step
    training_step_output = self.trainer.accelerator.training_step(step_kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 193, in training_step
    return self.training_type_plugin.training_step(*step_kwargs.values())
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/dp.py", line 93, in training_step
    return self.model(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 159, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py", line 63, in forward
    output = super().forward(*inputs, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/overrides/base.py", line 82, in forward
    output = self.module.training_step(*inputs, **kwargs)
  File "/home/jun1.oh/workspace/openspeech/openspeech/models/openspeech_encoder_decoder_model.py", line 166, in training_step
    logits = self.decoder(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/jun1.oh/workspace/openspeech/openspeech/decoders/lstm_attention_decoder.py", line 192, in forward
    step_outputs, hidden_states, attn = self.forward_step(
  File "/home/jun1.oh/workspace/openspeech/openspeech/decoders/lstm_attention_decoder.py", line 146, in forward_step
    outputs, hidden_states = self.rnn(embedded, hidden_states)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 581, in forward
    result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers,
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
upskyy commented 3 years ago

Could you show me training scripts?

jun-danieloh commented 3 years ago

It is giving errors in the middle of training.

This is my training script. I put all other settings in train.yaml.

python ./openspeech_cli/hydra_train.py dataset.dataset_path=/home/jun1.oh/ dataset.dataset_download=False dataset.manifest_file_path=/home/jun1.oh/LibriSpeech_daniel/libri_subword_manifest.txt tokenizer.vocab_path=/home/jun1.oh/LibriSpeech_daniel/ trainer.batch_size=16

This is my train.yaml:

defaults:
  - audio: fbank
  - augment: default
  - dataset: librispeech
  - criterion: cross_entropy
  - lr_scheduler: warmup_reduce_lr_on_plateau
  - model: conformer_lstm
  - trainer: gpu
  - tokenizer: libri_subword
upskyy commented 3 years ago

Thanks. I'll check it out as soon as possible.