k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
913 stars 293 forks source link

CUDA out of memory in decoding #66

Closed cdxie closed 3 years ago

cdxie commented 3 years ago

Hi, I am newer to learn icefall,I finished the training of tdnn_lstm_ctc, when run the decode steps, I meet the following error, I change the --max-duration, there are still errors:

2021-10-04 00:42:07,942 INFO [decode.py:383] Decoding started 2021-10-04 00:42:07,942 INFO [decode.py:384] {'exp_dir': PosixPath('tdnn_lstm_ctc/exp'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 80, 'subsampling_factor': 3, 'search_beam': 20, 'output_beam': 5, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'epoch': 19, 'avg': 5, 'method': 'whole-lattice-rescoring', 'num_paths': 100, 'lattice_score_scale': 0.5, 'export': False, 'full_libri': True, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 50, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2} 2021-10-04 00:42:08,361 INFO [lexicon.py:113] Loading pre-compiled data/lang_phone/Linv.pt 2021-10-04 00:42:08,614 INFO [decode.py:393] device: cuda:0 2021-10-04 00:42:23,560 INFO [decode.py:406] Loading G_4_gram.fst.txt 2021-10-04 00:42:23,560 WARNING [decode.py:407] It may take 8 minutes. Traceback (most recent call last): File "./tdnn_lstm_ctc/decode.py", line 492, in main() File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context return func(*args, **kwargs) File "./tdnn_lstm_ctc/decode.py", line 420, in main G = k2.arc_sort(G) File "/opt/conda/lib/python3.8/site-packages/k2-1.8.dev20210918+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/k2/fsa_algo.py", line 441, in arc_sort ragged_arc, arc_map = _k2.arc_sort(fsa.arcs, need_arc_map=need_arc_map) RuntimeError: CUDA out of memory. Tried to allocate 884.00 MiB (GPU 0; 15.78 GiB total capacity; 14.28 GiB already allocated; 461.19 MiB free; 14.29 GiB reserved in total by PyTorch)

the device used: +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.118.02 Driver Version: 440.118.02 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... Off | 00000000:3B:00.0 Off | 0 | | N/A 27C P0 25W / 250W | 12MiB / 16160MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-PCIE... Off | 00000000:D8:00.0 Off | 0 | | N/A 28C P0 25W / 250W | 12MiB / 16160MiB | 0% Default | +-------------------------------+----------------------+----------------------+

would you give me some advice?thanks

cdxie commented 3 years ago

Another question:we also have machine cluster, but the machine can not set device number so, should I mask the following lines of code in the decode.py: if torch.cuda.is_available(): device = torch.device("cuda", 0)

csukuangfj commented 3 years ago

I meet the following error, I change the --max-duration, there are still errors:

There are several things you can do:

(1) Change to a GPU with a larger RAM, i.e., 32 GB. (2) Use a decoding method that does not involve an LM, i.e., use --method 1best (3) Change https://github.com/k2-fsa/icefall/blob/adb068eb8242fe79dafce5a100c3fdfad934c7a5/egs/librispeech/ASR/tdnn_lstm_ctc/decode.py#L423-L424 to

 G = k2.arc_sort(G) 
 G = k2.Fsa.from_fsas([G]).to(device) 

I assume it will not cause OOM errors in later decoding steps. (4) Prune your G. You can use the script from https://github.com/kaldi-asr/kaldi/pull/4594 to prune your G. (Note: It is a single python script, having no dependencies on Kaldi).

csukuangfj commented 3 years ago

should I mask the following lines of code in the decode.py: if torch.cuda.is_available(): device = torch.device("cuda", 0)

Can you use device = torch.device("cuda") to select your default cuda device.

If you use CPU, it is going to be slow when you decode.

cdxie commented 3 years ago

I meet the following error, I change the --max-duration, there are still errors:

There are several things you can do:

(1) Change to a GPU with a larger RAM, i.e., 32 GB. (2) Use a decoding method that does not involve an LM, i.e., use --method 1best (3) Change

https://github.com/k2-fsa/icefall/blob/adb068eb8242fe79dafce5a100c3fdfad934c7a5/egs/librispeech/ASR/tdnn_lstm_ctc/decode.py#L423-L424

to

 G = k2.arc_sort(G) 
 G = k2.Fsa.from_fsas([G]).to(device) 

I assume it will not cause OOM errors in later decoding steps. (4) Prune your G. You can use the script from kaldi-asr/kaldi#4594 to prune your G. (Note: It is a single python script, having no dependencies on Kaldi).

I try the (3) method, there are still errors:

2021-10-05 00:00:07,427 INFO [decode.py:387] Decoding started 2021-10-05 00:00:07,427 INFO [decode.py:388] {'exp_dir': PosixPath('tdnn_lstm_ctc/exp'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 80, 'subsampling_factor': 3, 'search_beam': 20, 'output_beam': 5, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'epoch': 19, 'avg': 5, 'method': 'whole-lattice-rescoring', 'num_paths': 100, 'nbest_scale': 0.5, 'export': False, 'full_libri': True, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 100, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2} 2021-10-05 00:00:07,947 INFO [lexicon.py:113] Loading pre-compiled data/lang_phone/Linv.pt 2021-10-05 00:00:08,310 INFO [decode.py:397] device: cuda 2021-10-05 00:00:46,069 INFO [decode.py:410] Loading G_4_gram.fst.txt 2021-10-05 00:00:46,070 WARNING [decode.py:411] It may take 8 minutes. Traceback (most recent call last): File "./tdnn_lstm_ctc/decode.py", line 497, in main() File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context return func(*args, **kwargs) File "./tdnn_lstm_ctc/decode.py", line 435, in main G = k2.add_epsilon_self_loops(G) File "/opt/conda/lib/python3.8/site-packages/k2-1.8.dev20210918+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/k2/fsa_algo.py", line 499, in add_epsilon_self_loops ragged_arc, arc_map = _k2.add_epsilon_self_loops(fsa.arcs, RuntimeError: CUDA out of memory. Tried to allocate 4.73 GiB (GPU 0; 15.78 GiB total capacity; 9.21 GiB already allocated; 3.90 GiB free; 10.85 GiB reserved in total by PyTorch)

I think I should try the (1)