sooftware / kospeech

Open-Source Toolkit for End-to-End Korean Automatic Speech Recognition leveraging PyTorch and Hydra.
https://sooftware.github.io/kospeech/
Apache License 2.0
605 stars 192 forks source link

[Train Error] invalid shape, thread 에러 #36

Closed minlom closed 4 years ago

minlom commented 4 years ago

안녕하세요, 음성인식을 스터디 하다가 알게 되어 많은 참고를 하고 있습니다. 먼저 좋은 정보를 깔끔하게 정리해 주셔서 많은 도움이 됐습니다.

공유해주신 코드로 학습을 진행하려고 하니 아래와 같은 에러가 발생하고 진행이 되지 않는 상태로 프로세스가 떠있습니다. 기동 환경은 ubuntu OS이고 single GPU입니다. 데이터 전처리 부분은 코드를 참고하여 경로를 일부 변경해서 처리하였고, opts.py에 변경한 파일로 설정해주었습니다.

크게 두 가지 에러가 있는데,

  1. Invalid shape 에러 - RuntimeError: shape '[32, -1]' is invalid for input of size 2860 데이터셋은 AIHUB 데이터를 사용하였고, option 값은 opts.py의 default를 참조하도록 되어 있습니다. 여기 설정에서 놓친게 있을까요?

2.Thread 에러 - Exception ignored in: <module 'threading' from '/home/mchoe/.conda/envs/sr/lib/python3.6/threading.py'> 해당 에러는 코드를 강제 종료하면 발생하는데 thread 설정을 따로 해주어야 하는지요? 코드상에서는 data_loader.py에서 Threading을 쓰는데 이와 관련이 있을까요?

혹시 아시는 사항이 있어 답변해주시면 정말 감사하겠습니다.

[2020-07-20 17:51:47,573 utils.py:21 - info()] Operating System : Linux 5.3.0-61-generic [2020-07-20 17:51:47,573 utils.py:21 - info()] Processor : x86_64 [2020-07-20 17:51:47,574 utils.py:21 - info()] device : GeForce GTX 1080 Ti [2020-07-20 17:51:47,574 utils.py:21 - info()] CUDA is available : True [2020-07-20 17:51:47,574 utils.py:21 - info()] CUDA version : 10.2 [2020-07-20 17:51:47,574 utils.py:21 - info()] PyTorch version : 1.5.1 100%|██████████| 497658/497658 [00:09<00:00, 53032.36it/s] [2020-07-20 17:51:58,358 utils.py:21 - info()] split dataset start !! [2020-07-20 17:52:00,376 utils.py:21 - info()] split dataset complete !! [2020-07-20 17:52:01,611 utils.py:21 - info()] start [2020-07-20 17:52:01,611 utils.py:21 - info()] Epoch 0 start Traceback (most recent call last): File "/home/mchoe/PycharmProjects/sr2/bin/main.py", line 99, in main() File "/home/mchoe/PycharmProjects/sr2/bin/main.py", line 95, in main train(opt) File "/home/mchoe/PycharmProjects/sr2/bin/main.py", line 74, in train num_epochs=opt.num_epochs, teacher_forcing_ratio=opt.teacher_forcing_ratio, resume=opt.resume) File "/home/mchoe/PycharmProjects/sr2/kospeech/trainer/supervised_trainer.py", line 104, in train train_queue, teacher_forcing_ratio) File "/home/mchoe/PycharmProjects/sr2/kospeech/trainer/supervised_trainer.py", line 188, in train_epoches logit = model(inputs, input_lengths, targets, teacher_forcing_ratio=teacher_forcing_ratio) File "/home/mchoe/.conda/envs/sr/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, kwargs) File "/home/mchoe/.conda/envs/sr/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 153, in forward return self.module(*inputs[0], *kwargs[0]) File "/home/mchoe/.conda/envs/sr/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(input, kwargs) File "/home/mchoe/PycharmProjects/sr2/kospeech/models/seq2seq/seq2seq.py", line 37, in forward result = self.decoder(targets, output, teacher_forcing_ratio, language_model) File "/home/mchoe/.conda/envs/sr/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/mchoe/PycharmProjects/sr2/kospeech/models/seq2seq/decoder.py", line 127, in forward inputs = inputs[inputs != self.eos_id].view(batch_size, -1) RuntimeError: shape '[32, -1]' is invalid for input of size 2860 Exception ignored in: <module 'threading' from '/home/mchoe/.conda/envs/sr/lib/python3.6/threading.py'> Traceback (most recent call last): File "/home/mchoe/.conda/envs/sr/lib/python3.6/threading.py", line 1294, in _shutdown t.join() File "/home/mchoe/.conda/envs/sr/lib/python3.6/threading.py", line 1056, in join self._wait_for_tstate_lock() File "/home/mchoe/.conda/envs/sr/lib/python3.6/threading.py", line 1072, in _wait_for_tstate_lock elif lock.acquire(block, timeout): KeyboardInterrupt

Process finished with exit code 1

sooftware commented 4 years ago

기본 세팅으로 실행하셨다면, 문제없이 실행되는 것이 정상입니다만,
코드가 진행되지 않은채로 멈춰있었다면 스레드가 죽어서 그럴 수도 있을 것 같습니다.

보다 정확한 에러 확인을 위해 스레드 1개인 상태로 실행해서 어떻게 에러 혹은 화면이 멈추는지 볼 수 있을까요??
run.sh의 num_workers 옵션을 1로 설정해주시면 됩니다.

sooftware commented 4 years ago

invalid shape 같은 경우는 옵션이 어떻게 설정되었는지를 정확히 확인을 해봐야 할 것 같습니다.
아마 학습 시작 전에 옵션들을 모두 프린트하도록 되어 있을 텐데, 해당 옵션을 이슈에 남겨주시면 확인해보겠습니다.

minlom commented 4 years ago

빠른 답변 감사드립니다.

말씀하신 num_workers의 값을 1로 설정해서 기동해 보았는데 현상은 동일합니다ㅠㅠ conda 환경에서 구동하는데 문제 될 것이 있을까요? 그리고 혹시 1 epoch 학습 시에 소요 시간이 대략 얼마 정도 걸리셨는지 알 수 있을까요? 기동 환경과 함께 알려주시면 감사하겠습니다!!

추가적으로 설정된 옵션 로그도 첨부드립니다. num_workers를 1로 설정한 것과 데이터셋 파일 경로를 바꾼 것 이외에는 default 설정입니다.

[2020-07-21 09:40:48,147 utils.py:21 - info()] --mode: train [2020-07-21 09:40:48,147 utils.py:21 - info()] --transform_method: mel [2020-07-21 09:40:48,147 utils.py:21 - info()] --sample_rate: 16000 [2020-07-21 09:40:48,147 utils.py:21 - info()] --window_size: 20 [2020-07-21 09:40:48,147 utils.py:21 - info()] --stride: 10 [2020-07-21 09:40:48,147 utils.py:21 - info()] --n_mels: 80 [2020-07-21 09:40:48,147 utils.py:21 - info()] --normalize: False [2020-07-21 09:40:48,147 utils.py:21 - info()] --del_silence: False [2020-07-21 09:40:48,147 utils.py:21 - info()] --input_reverse: False [2020-07-21 09:40:48,147 utils.py:21 - info()] --feature_extract_by: librosa [2020-07-21 09:40:48,147 utils.py:21 - info()] --time_mask_para: 50 [2020-07-21 09:40:48,147 utils.py:21 - info()] --freq_mask_para: 12 [2020-07-21 09:40:48,147 utils.py:21 - info()] --time_mask_num: 2 [2020-07-21 09:40:48,147 utils.py:21 - info()] --freq_mask_num: 2 [2020-07-21 09:40:48,147 utils.py:21 - info()] --architecture: seq2seq [2020-07-21 09:40:48,147 utils.py:21 - info()] --use_bidirectional: False [2020-07-21 09:40:48,147 utils.py:21 - info()] --mask_conv: False [2020-07-21 09:40:48,147 utils.py:21 - info()] --hidden_dim: 256 [2020-07-21 09:40:48,147 utils.py:21 - info()] --dropout: 0.3 [2020-07-21 09:40:48,147 utils.py:21 - info()] --attn_mechanism: loc [2020-07-21 09:40:48,147 utils.py:21 - info()] --num_heads: 8 [2020-07-21 09:40:48,147 utils.py:21 - info()] --label_smoothing: 0.1 [2020-07-21 09:40:48,147 utils.py:21 - info()] --num_encoder_layers: 3 [2020-07-21 09:40:48,147 utils.py:21 - info()] --num_decoder_layers: 2 [2020-07-21 09:40:48,147 utils.py:21 - info()] --extractor: vgg [2020-07-21 09:40:48,147 utils.py:21 - info()] --activation: hardtanh [2020-07-21 09:40:48,147 utils.py:21 - info()] --rnn_type: gru [2020-07-21 09:40:48,147 utils.py:21 - info()] --teacher_forcing_ratio: 0.99 [2020-07-21 09:40:48,147 utils.py:21 - info()] --dataset_path: /home/mchoe/sr/data/ [2020-07-21 09:40:48,147 utils.py:21 - info()] --data_list_path: /home/mchoe/PycharmProjects/sr/data/data_list/filter_train_list2.csv [2020-07-21 09:40:48,147 utils.py:21 - info()] --label_path: /home/mchoe/PycharmProjects/sr/verf/aihub_label_table.dat [2020-07-21 09:40:48,147 utils.py:21 - info()] --spec_augment: False [2020-07-21 09:40:48,147 utils.py:21 - info()] --noise_augment: False [2020-07-21 09:40:48,147 utils.py:21 - info()] --noiseset_size: 1000 [2020-07-21 09:40:48,147 utils.py:21 - info()] --noise_level: 0.7 [2020-07-21 09:40:48,147 utils.py:21 - info()] --use_cuda: True [2020-07-21 09:40:48,147 utils.py:21 - info()] --batch_size: 32 [2020-07-21 09:40:48,147 utils.py:21 - info()] --num_workers: 1 [2020-07-21 09:40:48,147 utils.py:21 - info()] --num_epochs: 20 [2020-07-21 09:40:48,147 utils.py:21 - info()] --init_lr: 0.0003 [2020-07-21 09:40:48,147 utils.py:21 - info()] --high_plateau_lr: 0.0003 [2020-07-21 09:40:48,147 utils.py:21 - info()] --low_plateau_lr: 3e-05 [2020-07-21 09:40:48,148 utils.py:21 - info()] --decay_threshold: 0.02 [2020-07-21 09:40:48,148 utils.py:21 - info()] --rampup_period: 1000 [2020-07-21 09:40:48,148 utils.py:21 - info()] --exp_decay_period: 160000 [2020-07-21 09:40:48,148 utils.py:21 - info()] --valid_ratio: 0.01 [2020-07-21 09:40:48,148 utils.py:21 - info()] --max_len: 120 [2020-07-21 09:40:48,148 utils.py:21 - info()] --max_grad_norm: 400 [2020-07-21 09:40:48,148 utils.py:21 - info()] --teacher_forcing_step: 0.05 [2020-07-21 09:40:48,148 utils.py:21 - info()] --min_teacher_forcing_ratio: 0.7 [2020-07-21 09:40:48,148 utils.py:21 - info()] --seed: 7 [2020-07-21 09:40:48,148 utils.py:21 - info()] --save_result_every: 1000 [2020-07-21 09:40:48,148 utils.py:21 - info()] --checkpoint_every: 5000 [2020-07-21 09:40:48,148 utils.py:21 - info()] --print_every: 10 [2020-07-21 09:40:48,148 utils.py:21 - info()] --resume: False [2020-07-21 09:40:48,163 utils.py:21 - info()] Operating System : Linux 5.3.0-61-generic [2020-07-21 09:40:48,163 utils.py:21 - info()] Processor : x86_64 [2020-07-21 09:40:48,164 utils.py:21 - info()] device : GeForce GTX 1080 Ti [2020-07-21 09:40:48,164 utils.py:21 - info()] CUDA is available : True [2020-07-21 09:40:48,164 utils.py:21 - info()] CUDA version : 10.2 [2020-07-21 09:40:48,164 utils.py:21 - info()] PyTorch version : 1.5.1 100%|██████████| 497658/497658 [00:09<00:00, 52784.78it/s] [2020-07-21 09:40:58,967 utils.py:21 - info()] split dataset start !! [2020-07-21 09:41:01,214 utils.py:21 - info()] split dataset complete !! [2020-07-21 09:41:02,409 utils.py:21 - info()] start [2020-07-21 09:41:02,409 utils.py:21 - info()] Epoch 0 start Traceback (most recent call last): File "/home/mchoe/PycharmProjects/sr2/bin/main.py", line 101, in main() File "/home/mchoe/PycharmProjects/sr2/bin/main.py", line 97, in main train(opt) File "/home/mchoe/PycharmProjects/sr2/bin/main.py", line 75, in train num_epochs=opt.num_epochs, teacher_forcing_ratio=opt.teacher_forcing_ratio, resume=opt.resume) File "/home/mchoe/PycharmProjects/sr2/kospeech/trainer/supervised_trainer.py", line 104, in train train_queue, teacher_forcing_ratio) File "/home/mchoe/PycharmProjects/sr2/kospeech/trainer/supervised_trainer.py", line 188, in train_epoches logit = model(inputs, input_lengths, targets, teacher_forcing_ratio=teacher_forcing_ratio) File "/home/mchoe/.conda/envs/sr/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, kwargs) File "/home/mchoe/.conda/envs/sr/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 153, in forward return self.module(*inputs[0], *kwargs[0]) File "/home/mchoe/.conda/envs/sr/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(input, kwargs) File "/home/mchoe/PycharmProjects/sr2/kospeech/models/seq2seq/seq2seq.py", line 37, in forward result = self.decoder(targets, output, teacher_forcing_ratio, language_model) File "/home/mchoe/.conda/envs/sr/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/mchoe/PycharmProjects/sr2/kospeech/models/seq2seq/decoder.py", line 126, in forward inputs = inputs[inputs != self.eos_id].view(batch_size, -1) RuntimeError: shape '[32, -1]' is invalid for input of size 3300 Exception ignored in: <module 'threading' from '/home/mchoe/.conda/envs/sr/lib/python3.6/threading.py'> Traceback (most recent call last): File "/home/mchoe/.conda/envs/sr/lib/python3.6/threading.py", line 1294, in _shutdown t.join() File "/home/mchoe/.conda/envs/sr/lib/python3.6/threading.py", line 1056, in join self._wait_for_tstate_lock() File "/home/mchoe/.conda/envs/sr/lib/python3.6/threading.py", line 1072, in _wait_for_tstate_lock elif lock.acquire(block, timeout): KeyboardInterrupt

sooftware commented 4 years ago

아마 파이참 등의 환경에서 바로 main.py를 실행하신 것 같은데, run.sh를 이용해서 지정된 옵션으로 실행해주셔야 합니다.
use_cuda와 같은 옵션들은 디폴트가 False로 되어있고, 올려주신 내용을 보게 되면, True로 지정되어야 하는 부분들이 False로 많이 지정된 것을 볼 수 있습니다. (옵션을 넘겨주게 되면 True로 변환해주는 방법을 사용하기 때문에 디폴트는 False입니다.)

또한 저는 GTX TITAN X 4대로 구성된 서버를 사용해서 1에폭에 17시간 정도 소요되고 있습니다. (옵션따라 다르고, 서버 자체가 오래된 서버라 더 오래 걸릴 수도 있습니다.)

minlom commented 4 years ago

말씀해주신 run.sh로 기동하였는데 동일하게 invalid shape 에러가 발생합니다ㅠㅠ 아래 설정에서 다른 것이 있을까요?

기동 했을 때의 설정값 로그입니다. 아래 파라미터 조정하였습니다.

[2020-07-21 11:00:35,319 utils.py:21 - info()] --mode: train [2020-07-21 11:00:35,319 utils.py:21 - info()] --transform_method: spect [2020-07-21 11:00:35,319 utils.py:21 - info()] --sample_rate: 16000 [2020-07-21 11:00:35,320 utils.py:21 - info()] --window_size: 20 [2020-07-21 11:00:35,320 utils.py:21 - info()] --stride: 10 [2020-07-21 11:00:35,320 utils.py:21 - info()] --n_mels: 80 [2020-07-21 11:00:35,320 utils.py:21 - info()] --normalize: True [2020-07-21 11:00:35,320 utils.py:21 - info()] --del_silence: True [2020-07-21 11:00:35,320 utils.py:21 - info()] --input_reverse: False [2020-07-21 11:00:35,320 utils.py:21 - info()] --feature_extract_by: librosa [2020-07-21 11:00:35,320 utils.py:21 - info()] --time_mask_para: 40 [2020-07-21 11:00:35,320 utils.py:21 - info()] --freq_mask_para: 12 [2020-07-21 11:00:35,320 utils.py:21 - info()] --time_mask_num: 2 [2020-07-21 11:00:35,320 utils.py:21 - info()] --freq_mask_num: 2 [2020-07-21 11:00:35,320 utils.py:21 - info()] --architecture: seq2seq [2020-07-21 11:00:35,320 utils.py:21 - info()] --use_bidirectional: True [2020-07-21 11:00:35,320 utils.py:21 - info()] --mask_conv: False [2020-07-21 11:00:35,320 utils.py:21 - info()] --hidden_dim: 512 [2020-07-21 11:00:35,320 utils.py:21 - info()] --dropout: 0.4 [2020-07-21 11:00:35,320 utils.py:21 - info()] --attn_mechanism: dot [2020-07-21 11:00:35,320 utils.py:21 - info()] --num_heads: 4 [2020-07-21 11:00:35,320 utils.py:21 - info()] --label_smoothing: 0.1 [2020-07-21 11:00:35,320 utils.py:21 - info()] --num_encoder_layers: 3 [2020-07-21 11:00:35,320 utils.py:21 - info()] --num_decoder_layers: 2 [2020-07-21 11:00:35,320 utils.py:21 - info()] --extractor: vgg [2020-07-21 11:00:35,320 utils.py:21 - info()] --activation: hardtanh [2020-07-21 11:00:35,320 utils.py:21 - info()] --rnn_type: lstm [2020-07-21 11:00:35,320 utils.py:21 - info()] --teacher_forcing_ratio: 1.0 [2020-07-21 11:00:35,320 utils.py:21 - info()] --dataset_path: /home/mchoe/sr/data/ [2020-07-21 11:00:35,320 utils.py:21 - info()] --data_list_path: /home/mchoe/PycharmProjects/sr/data/data_list/filter_train_list3.csv [2020-07-21 11:00:35,320 utils.py:21 - info()] --label_path: /home/mchoe/PycharmProjects/sr/verf/aihub_label_table.dat [2020-07-21 11:00:35,320 utils.py:21 - info()] --spec_augment: True [2020-07-21 11:00:35,320 utils.py:21 - info()] --noise_augment: False [2020-07-21 11:00:35,320 utils.py:21 - info()] --noiseset_size: 1000 [2020-07-21 11:00:35,320 utils.py:21 - info()] --noise_level: 0.7 [2020-07-21 11:00:35,320 utils.py:21 - info()] --use_cuda: True [2020-07-21 11:00:35,320 utils.py:21 - info()] --batch_size: 32 [2020-07-21 11:00:35,320 utils.py:21 - info()] --num_workers: 1 [2020-07-21 11:00:35,320 utils.py:21 - info()] --num_epochs: 10 [2020-07-21 11:00:35,320 utils.py:21 - info()] --init_lr: 0.0003 [2020-07-21 11:00:35,320 utils.py:21 - info()] --high_plateau_lr: 0.0003 [2020-07-21 11:00:35,320 utils.py:21 - info()] --low_plateau_lr: 1e-05 [2020-07-21 11:00:35,320 utils.py:21 - info()] --decay_threshold: 0.02 [2020-07-21 11:00:35,320 utils.py:21 - info()] --rampup_period: 0 [2020-07-21 11:00:35,320 utils.py:21 - info()] --exp_decay_period: 120000 [2020-07-21 11:00:35,320 utils.py:21 - info()] --valid_ratio: 0.002 [2020-07-21 11:00:35,320 utils.py:21 - info()] --max_len: 120 [2020-07-21 11:00:35,320 utils.py:21 - info()] --max_grad_norm: 400 [2020-07-21 11:00:35,320 utils.py:21 - info()] --teacher_forcing_step: 0.02 [2020-07-21 11:00:35,320 utils.py:21 - info()] --min_teacher_forcing_ratio: 0.8 [2020-07-21 11:00:35,320 utils.py:21 - info()] --seed: 7 [2020-07-21 11:00:35,321 utils.py:21 - info()] --save_result_every: 1000 [2020-07-21 11:00:35,321 utils.py:21 - info()] --checkpoint_every: 5000 [2020-07-21 11:00:35,321 utils.py:21 - info()] --print_every: 10 [2020-07-21 11:00:35,321 utils.py:21 - info()] --resume: False [2020-07-21 11:00:35,332 utils.py:21 - info()] Operating System : Linux 5.3.0-61-generic [2020-07-21 11:00:35,332 utils.py:21 - info()] Processor : x86_64 [2020-07-21 11:00:35,333 utils.py:21 - info()] device : GeForce GTX 1080 Ti [2020-07-21 11:00:35,333 utils.py:21 - info()] CUDA is available : True [2020-07-21 11:00:35,333 utils.py:21 - info()] CUDA version : 10.2 [2020-07-21 11:00:35,333 utils.py:21 - info()] PyTorch version : 1.5.1 [2020-07-21 11:00:35,937 utils.py:21 - info()] split dataset start !! [2020-07-21 11:00:35,991 utils.py:21 - info()] Applying Spec Augmentation... [2020-07-21 11:00:36,049 utils.py:21 - info()] split dataset complete !! [2020-07-21 11:00:37,414 utils.py:21 - info()] start [2020-07-21 11:00:37,414 utils.py:21 - info()] Epoch 0 start Traceback (most recent call last): File "./main.py", line 101, in main() File "./main.py", line 97, in main train(opt) File "./main.py", line 75, in train num_epochs=opt.num_epochs, teacher_forcing_ratio=opt.teacher_forcing_ratio, resume=opt.resume) File "../kospeech/trainer/supervised_trainer.py", line 104, in train train_queue, teacher_forcing_ratio) File "../kospeech/trainer/supervised_trainer.py", line 188, in train_epoches logit = model(inputs, input_lengths, targets, teacher_forcing_ratio=teacher_forcing_ratio) File "/home/mchoe/.conda/envs/sr/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, kwargs) File "/home/mchoe/.conda/envs/sr/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 153, in forward return self.module(*inputs[0], *kwargs[0]) File "/home/mchoe/.conda/envs/sr/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(input, kwargs) File "../kospeech/models/seq2seq/seq2seq.py", line 37, in forward result = self.decoder(targets, output, teacher_forcing_ratio, language_model) File "/home/mchoe/.conda/envs/sr/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "../kospeech/models/seq2seq/decoder.py", line 126, in forward inputs = inputs[inputs != self.eos_id].view(batch_size, -1) RuntimeError: shape '[32, -1]' is invalid for input of size 2990

그리고 다시 돌려보면 설정은 그대로인데 GPU oom 에러가 발생하기도 해서 스펙이 좀더 좋은 서버에 다시 구축하여 기동해보려고 합니다. Traceback (most recent call last): File "./main.py", line 101, in main() File "./main.py", line 97, in main train(opt) File "./main.py", line 75, in train num_epochs=opt.num_epochs, teacher_forcing_ratio=opt.teacher_forcing_ratio, resume=opt.resume) File "../kospeech/trainer/supervised_trainer.py", line 104, in train train_queue, teacher_forcing_ratio) File "../kospeech/trainer/supervised_trainer.py", line 188, in train_epoches logit = model(inputs, input_lengths, targets, teacher_forcing_ratio=teacher_forcing_ratio) File "/home/mchoe/.conda/envs/sr/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, kwargs) File "/home/mchoe/.conda/envs/sr/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 153, in forward return self.module(*inputs[0], *kwargs[0]) File "/home/mchoe/.conda/envs/sr/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(input, kwargs) File "../kospeech/models/seq2seq/seq2seq.py", line 36, in forward output = self.encoder(inputs, input_lengths) File "/home/mchoe/.conda/envs/sr/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, *kwargs) File "../kospeech/models/seq2seq/encoder.py", line 74, in forward output, hidden = self.rnn(conv_feat) File "/home/mchoe/.conda/envs/sr/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(input, **kwargs) File "/home/mchoe/.conda/envs/sr/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 570, in forward self.dropout, self.training, self.bidirectional, self.batch_first) RuntimeError: CUDA out of memory. Tried to allocate 914.00 MiB (GPU 0; 10.92 GiB total capacity; 9.76 GiB already allocated; 508.19 MiB free; 9.80 GiB reserved in total by PyTorch)

sooftware commented 4 years ago

코드를 새로 받아서 실행해보겠습니다.

sooftware commented 4 years ago

코드를 새로 받아서 실행해보실래요??
저는 코드를 새로 받아서 배치사이즈와 데이터셋 패스만 수정하여 실행했는데 정상적으로 학습이 진행됩니다.

sooftware commented 4 years ago

최근에 디버그하면서 커밋을 한적 있는데 그때 코드를 받으셔서 그럴수도 있습니다.

minlom commented 4 years ago

코드를 새로 받아서 실행해 보았는데 cudnn 에러가 발생합니다. 좀 찾아보니까 라이브러리 버전 문제인거 같기도 하고 원인은 좀더 살펴봐야 할거 같습니다ㅠㅠ 혹시 cuda와 pytorch 버전이 어떻게 되시나요?

Traceback (most recent call last): File "./main.py", line 100, in main() File "./main.py", line 96, in main train(opt) File "./main.py", line 75, in train num_epochs=opt.num_epochs, teacher_forcing_ratio=opt.teacher_forcing_ratio, resume=opt.resume) File "../kospeech/trainer/supervised_trainer.py", line 104, in train train_queue, teacher_forcing_ratio) File "../kospeech/trainer/supervised_trainer.py", line 190, in train_epoches logit = model(inputs, input_lengths, targets, teacher_forcing_ratio=teacher_forcing_ratio) File "/home/mchoe/.conda/envs/sr/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, kwargs) File "/home/mchoe/.conda/envs/sr/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 153, in forward return self.module(*inputs[0], *kwargs[0]) File "/home/mchoe/.conda/envs/sr/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(input, kwargs) File "../kospeech/models/seq2seq/seq2seq.py", line 36, in forward output = self.encoder(inputs, input_lengths) File "/home/mchoe/.conda/envs/sr/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, *kwargs) File "../kospeech/models/seq2seq/encoder.py", line 74, in forward output, hidden = self.rnn(conv_feat) File "/home/mchoe/.conda/envs/sr/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(input, **kwargs) File "/home/mchoe/.conda/envs/sr/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 570, in forward self.dropout, self.training, self.bidirectional, self.batch_first) RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

sooftware commented 4 years ago

저는 파이토치 1.5.0 CUDA 10.1 버젼을 사용중인데, torch.org 가셔서 서로 호환되는 버젼의 파이토치와 쿠다 버젼을 맞춰주셔야 합니다. 이 외에 환경설정 관련 부분은 제가 답변 드리기가 어려울 것 같습니다.

minlom commented 4 years ago

네, 확인 감사합니다. 해당 이슈는 좀더 보고 원인파악이 되면 공유드리겠습니다.

sooftware commented 4 years ago

네 해당 이슈는 저희 코드에서 난 에러라기 보다는 CUDA와 cuDNN에서 난 에러인 것 같습니다.
저도 전에 CUDNN_STATUS_EXECUTION_FAILED 에러가 난 적이 있는데 파이토치 및 CUDA, cuDNN 재설치 등의 작업을 통해 해결되었던 것으로 기억합니다.

아 그리고, CUDA_LAUNCH_BLOCKING=1 옵션을 추가하시고 해보시겠어요? single GPU 환경에서는 해당 옵션을 주고 실행했던 기억이 있습니다.