Closed EmreOzkose closed 3 years ago
Note that it can be also a memory issue, because I have a small memory (16gb). However, If the problem was a memory issue, I would expect to observe an error like:
RuntimeError: CUDA out of memory. Tried to allocate 420.00 MiB (GPU 0; 15.90 GiB total capacity; 3.23 GiB already allocated; 168.75 MiB free; 3.56 GiB reserved in total by PyTorch)
Perhaps it's trying to use >1 GPU somehow? (But it shouldn't). If that's the case, setting something likeCUDA_VISIBLE_DEVICES=0(or whatever)should address it.Another possibility is that cuda:-2 is not a real device but some kind of error code. That error message likely comes from torch.I think it would be worthwhile to try to catch the error in pdb, and print out the devices of all inputs to the function that failed.Once we know which object has the bad device, we can more easily debug.
File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 66, in forward
return _k2.index_select(src, index, default_value)
Could you modify /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py
, line 66,
print(src.device, index.device)
return _k2.index_select(src, index, default_value)
It may show something that is useful.
@csukuangfj I already printed devices before, but all of them was cuda:0.
@danpovey I have 4 devices, but before training, I am setting CUDA_VISIBLE_DEVICES=0. I will also try to debug with pdb.
I added try-catch block to function decode_one_batch() in decode.py as:
try:
best_path = nbest_decoding(
lattice=lattice,
num_paths=params.num_paths,
use_double_scores=params.use_double_scores,
)
except:
breakpoint()
when I run python -m pdb tdnn_lstm_ctc/decode.py --avg 1 --epoch 8
:
(k2) yunusemre.ozkose@boxx-3:/path/to/k2/icefall/egs/from_wav_scp/ASR$ python -m pdb tdnn_lstm_ctc/decode.py --avg 1 --epoch 8
> /path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py(3)<module>()
-> import os
(Pdb) c
2021-09-02 15:43:01,990 INFO [decode.py:330] Decoding started
2021-09-02 15:43:01,990 INFO [decode.py:331] {'exp_dir': PosixPath('tdnn_lstm_ctc/exp9_w2v2'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 1024, 'subsampling_factor': 3, 'search_beam': 20, 'output_beam': 5, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'method': 'nbest', 'num_paths': 30, 'max_frames': 1000, 'epoch': 8, 'avg': 1, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 500.0, 'bucketing_sampler': False, 'num_buckets': 30, 'concatenate_cuts': True, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'full_libri': False}
2021-09-02 15:43:02,604 INFO [lexicon.py:96] Loading pre-compiled data/lang_phone/Linv.pt
2021-09-02 15:43:02,963 INFO [decode.py:340] device: cuda:0
2021-09-02 15:43:09,784 INFO [checkpoint.py:75] Loading checkpoint from tdnn_lstm_ctc/exp9_w2v2/epoch-8.pt
/path/to/k2/lhotse/lhotse/dataset/sampling/single_cut.py:170: UserWarning: The first cut drawn in batch collection violates the max_frames or max_cuts constraints - we'll return it anyway. Consider increasing max_frames/max_cuts.
warnings.warn(
2021-09-02 15:43:11,389 INFO [decode.py:277] batch 0, cuts processed until now is 1/171 (0.584795%)
> /path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py(185)decode_one_batch()
-> key = f"no_rescore-{params.num_paths}"
(Pdb) lattice.device
device(type='cuda', index=0)
(Pdb)
Problem occurs in nbest_decoding(). Only lattice tensor is given to that function and its device is 0.
I think you are not quite at the place where it failed-need to do "c" (continue) maybe?
When I didn't add a try-catch block, log is :
(k2) yunusemre.ozkose@boxx-3:/path/to/k2/icefall/egs/from_wav_scp/ASR$ python -m pdb tdnn_lstm_ctc/decode.py --avg 1 --epoch 8
> /path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py(3)<module>()
-> import os
(Pdb) c
2021-09-02 16:33:33,700 INFO [decode.py:327] Decoding started
2021-09-02 16:33:33,701 INFO [decode.py:328] {'exp_dir': PosixPath('tdnn_lstm_ctc/exp9_w2v2'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 1024, 'subsampling_factor': 3, 'search_beam': 20, 'output_beam': 5, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'method': 'nbest', 'num_paths': 30, 'max_frames': 1000, 'epoch': 8, 'avg': 1, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 500.0, 'bucketing_sampler': False, 'num_buckets': 30, 'concatenate_cuts': True, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'full_libri': False}
2021-09-02 16:33:34,178 INFO [lexicon.py:96] Loading pre-compiled data/lang_phone/Linv.pt
2021-09-02 16:33:34,494 INFO [decode.py:337] device: cuda:0
2021-09-02 16:33:45,349 INFO [checkpoint.py:75] Loading checkpoint from tdnn_lstm_ctc/exp9_w2v2/epoch-8.pt
/path/to/k2/lhotse/lhotse/dataset/sampling/single_cut.py:170: UserWarning: The first cut drawn in batch collection violates the max_frames or max_cuts constraints - we'll return it anyway. Consider increasing max_frames/max_cuts.
warnings.warn(
2021-09-02 16:33:47,481 INFO [decode.py:274] batch 0, cuts processed until now is 1/171 (0.584795%)
Traceback (most recent call last):
File "/path/to/miniconda3/envs/k2/lib/python3.8/pdb.py", line 1705, in main
pdb._runscript(mainpyfile)
File "/path/to/miniconda3/envs/k2/lib/python3.8/pdb.py", line 1573, in _runscript
self.run(statement)
File "/path/to/miniconda3/envs/k2/lib/python3.8/bdb.py", line 580, in run
exec(cmd, globals, locals)
File "<string>", line 1, in <module>
File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 3, in <module>
import os
File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 418, in main
results_dict = decode_dataset(
File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 253, in decode_dataset
hyps_dict = decode_one_batch(
File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 176, in decode_one_batch
best_path = nbest_decoding(
File "/path/to/k2/icefall/icefall/decode.py", line 208, in nbest_decoding
path_lattice = _intersect_device(
File "/path/to/k2/icefall/icefall/decode.py", line 25, in _intersect_device
return k2.intersect_device(
File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/fsa_algo.py", line 204, in intersect_device
out_fsas = k2.utils.fsa_from_binary_function_tensor(a_fsas, b_fsas,
File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/utils.py", line 581, in fsa_from_binary_function_tensor
value = index_select(a_value, a_arc_map, default_value=filler) \
File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 160, in index_select
ans = _IndexSelectFunction.apply(src, index, default_value)
File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 66, in forward
return _k2.index_select(src, index, default_value)
RuntimeError: Specified device cuda:0 does not match device of data cuda:-2
Exception raised from from_blob at /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/include/ATen/Functions.h:2267 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f359b5e32f2 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f359b5e067b in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x28200 (0x7f34fa699200 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #3: <unknown function> + 0x10e0a1 (0x7f34fa77f0a1 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x84bce (0x7f34fa6f5bce in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #5: <unknown function> + 0x8858f (0x7f34fa6f958f in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #6: <unknown function> + 0x9f876 (0x7f34fa710876 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #7: <unknown function> + 0x1dfcf (0x7f34fa68efcf in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
<omitting python frames>
frame #13: THPFunction_apply(_object*, _object*) + 0x8fd (0x7f35f253a41d in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
Uncaught exception. Entering post mortem debugging
Running 'cont' or 'step' will restart the program
> /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(66)forward()
-> return _k2.index_select(src, index, default_value)
(Pdb) lattice.device
*** NameError: name 'lattice' is not defined
(Pdb)
I can't reach lattice after error, hence I added try-catch block.
I added breakpoint to place where @csukuangfj said. Log is here:
(Pdb) c
> /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward()
-> return _k2.index_select(src, index, default_value)
(Pdb) src.device; index.device; default_value;
device(type='cuda', index=0)
device(type='cuda', index=0)
0.0
(Pdb) c
> /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward()
-> return _k2.index_select(src, index, default_value)
(Pdb) src.device; index.device; default_value;
device(type='cuda', index=0)
device(type='cuda', index=0)
0.0
(Pdb) c
> /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward()
-> return _k2.index_select(src, index, default_value)
(Pdb) src.device; index.device; default_value;
device(type='cuda', index=0)
device(type='cuda', index=0)
0.0
(Pdb) c
> /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward()
-> return _k2.index_select(src, index, default_value)
(Pdb) src.device; index.device; default_value;
device(type='cuda', index=0)
device(type='cuda', index=0)
0.0
(Pdb) c
> /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward()
-> return _k2.index_select(src, index, default_value)
(Pdb) src.device; index.device; default_value;
device(type='cuda', index=0)
device(type='cuda', index=0)
0.0
(Pdb) c
Traceback (most recent call last):
File "/path/to/miniconda3/envs/k2/lib/python3.8/pdb.py", line 1705, in main
pdb._runscript(mainpyfile)
File "/path/to/miniconda3/envs/k2/lib/python3.8/pdb.py", line 1573, in _runscript
self.run(statement)
File "/path/to/miniconda3/envs/k2/lib/python3.8/bdb.py", line 580, in run
exec(cmd, globals, locals)
File "<string>", line 1, in <module>
File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 435, in <module>
main()
File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 418, in main
results_dict = decode_dataset(
File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 253, in decode_dataset
hyps_dict = decode_one_batch(
File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 176, in decode_one_batch
best_path = nbest_decoding(
File "/path/to/k2/icefall/icefall/decode.py", line 208, in nbest_decoding
path_lattice = _intersect_device(
File "/path/to/k2/icefall/icefall/decode.py", line 25, in _intersect_device
return k2.intersect_device(
File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/fsa_algo.py", line 204, in intersect_device
out_fsas = k2.utils.fsa_from_binary_function_tensor(a_fsas, b_fsas,
File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/utils.py", line 581, in fsa_from_binary_function_tensor
value = index_select(a_value, a_arc_map, default_value=filler) \
File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 161, in index_select
ans = _IndexSelectFunction.apply(src, index, default_value)
File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 67, in forward
return _k2.index_select(src, index, default_value)
RuntimeError: Specified device cuda:0 does not match device of data cuda:-2
Exception raised from from_blob at /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/include/ATen/Functions.h:2267 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fe9a54c82f2 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7fe9a54c567b in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x28200 (0x7fe904576200 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #3: <unknown function> + 0x10e0a1 (0x7fe90465c0a1 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x84bce (0x7fe9045d2bce in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #5: <unknown function> + 0x8858f (0x7fe9045d658f in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #6: <unknown function> + 0x9f876 (0x7fe9045ed876 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #7: <unknown function> + 0x1dfcf (0x7fe90456bfcf in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
<omitting python frames>
frame #13: THPFunction_apply(_object*, _object*) + 0x8fd (0x7fe9fc41f41d in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
Uncaught exception. Entering post mortem debugging
Running 'cont' or 'step' will restart the program
> /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward()
-> return _k2.index_select(src, index, default_value)
(Pdb)
the place in miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py :
65: breakpoint()
66: return _k2.index_select(src, index, default_value)
It might be possible to catch the exception in gdb by doing: gdb --args python3 whatever.py (gdb) catch throw (gdb) r ...
On Thu, Sep 2, 2021 at 9:54 PM Yunusemre @.***> wrote:
I added breakpoint to place where @csukuangfj https://github.com/csukuangfj said. Log is here:
(Pdb) c
/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward() -> return _k2.index_select(src, index, default_value) (Pdb) src.device; index.device; default_value; device(type='cuda', index=0) device(type='cuda', index=0) 0.0 (Pdb) c /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward() -> return _k2.index_select(src, index, default_value) (Pdb) src.device; index.device; default_value; device(type='cuda', index=0) device(type='cuda', index=0) 0.0 (Pdb) c /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward() -> return _k2.index_select(src, index, default_value) (Pdb) src.device; index.device; default_value; device(type='cuda', index=0) device(type='cuda', index=0) 0.0 (Pdb) c /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward() -> return _k2.index_select(src, index, default_value) (Pdb) src.device; index.device; default_value; device(type='cuda', index=0) device(type='cuda', index=0) 0.0 (Pdb) c /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward() -> return _k2.index_select(src, index, default_value) (Pdb) src.device; index.device; default_value; device(type='cuda', index=0) device(type='cuda', index=0) 0.0 (Pdb) c Traceback (most recent call last): File "/path/to/miniconda3/envs/k2/lib/python3.8/pdb.py", line 1705, in main pdb._runscript(mainpyfile) File "/path/to/miniconda3/envs/k2/lib/python3.8/pdb.py", line 1573, in _runscript self.run(statement) File "/path/to/miniconda3/envs/k2/lib/python3.8/bdb.py", line 580, in run exec(cmd, globals, locals) File "
", line 1, in File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 435, in main() File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, *kwargs) File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 418, in main results_dict = decode_dataset( File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 253, in decode_dataset hyps_dict = decode_one_batch( File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 176, in decode_one_batch best_path = nbest_decoding( File "/path/to/k2/icefall/icefall/decode.py", line 208, in nbest_decoding path_lattice = _intersect_device( File "/path/to/k2/icefall/icefall/decode.py", line 25, in _intersect_device return k2.intersect_device( File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/fsa_algo.py", line 204, in intersect_device out_fsas = k2.utils.fsa_from_binary_function_tensor(a_fsas, b_fsas, File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/utils.py", line 581, in fsa_from_binary_function_tensor value = index_select(a_value, a_arc_map, default_value=filler) \ File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 161, in index_select ans = _IndexSelectFunction.apply(src, index, default_value) File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 67, in forward return _k2.index_select(src, index, default_value) RuntimeError: Specified device cuda:0 does not match device of data cuda:-2 Exception raised from from_blob at /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/include/ATen/Functions.h:2267 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fe9a54c82f2 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const*, unsigned int, std::string const&) + 0x5b (0x7fe9a54c567b in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so) frame #2: + 0x28200 (0x7fe904576200 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so) frame #3: + 0x10e0a1 (0x7fe90465c0a1 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so) frame #4: + 0x84bce (0x7fe9045d2bce in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so) frame #5: + 0x8858f (0x7fe9045d658f in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so) frame #6: + 0x9f876 (0x7fe9045ed876 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so) frame #7: + 0x1dfcf (0x7fe90456bfcf in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so) frame #13: THPFunction_apply(_object*, _object*) + 0x8fd (0x7fe9fc41f41d in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libtorch_python.so) Uncaught exception. Entering post mortem debugging Running 'cont' or 'step' will restart the program
/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward() -> return _k2.index_select(src, index, default_value) (Pdb)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/33#issuecomment-911710132, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZOQ7YW7B6MVE3R5CTT756YLANCNFSM5DI5NN6Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
... running with a debug version of k2 would help, there, though.
On Thu, Sep 2, 2021 at 10:02 PM Daniel Povey @.***> wrote:
It might be possible to catch the exception in gdb by doing: gdb --args python3 whatever.py (gdb) catch throw (gdb) r ...
On Thu, Sep 2, 2021 at 9:54 PM Yunusemre @.***> wrote:
I added breakpoint to place where @csukuangfj https://github.com/csukuangfj said. Log is here:
(Pdb) c
/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward() -> return _k2.index_select(src, index, default_value) (Pdb) src.device; index.device; default_value; device(type='cuda', index=0) device(type='cuda', index=0) 0.0 (Pdb) c /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward() -> return _k2.index_select(src, index, default_value) (Pdb) src.device; index.device; default_value; device(type='cuda', index=0) device(type='cuda', index=0) 0.0 (Pdb) c /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward() -> return _k2.index_select(src, index, default_value) (Pdb) src.device; index.device; default_value; device(type='cuda', index=0) device(type='cuda', index=0) 0.0 (Pdb) c /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward() -> return _k2.index_select(src, index, default_value) (Pdb) src.device; index.device; default_value; device(type='cuda', index=0) device(type='cuda', index=0) 0.0 (Pdb) c /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward() -> return _k2.index_select(src, index, default_value) (Pdb) src.device; index.device; default_value; device(type='cuda', index=0) device(type='cuda', index=0) 0.0 (Pdb) c Traceback (most recent call last): File "/path/to/miniconda3/envs/k2/lib/python3.8/pdb.py", line 1705, in main pdb._runscript(mainpyfile) File "/path/to/miniconda3/envs/k2/lib/python3.8/pdb.py", line 1573, in _runscript self.run(statement) File "/path/to/miniconda3/envs/k2/lib/python3.8/bdb.py", line 580, in run exec(cmd, globals, locals) File "
", line 1, in File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 435, in main() File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, *kwargs) File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 418, in main results_dict = decode_dataset( File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 253, in decode_dataset hyps_dict = decode_one_batch( File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 176, in decode_one_batch best_path = nbest_decoding( File "/path/to/k2/icefall/icefall/decode.py", line 208, in nbest_decoding path_lattice = _intersect_device( File "/path/to/k2/icefall/icefall/decode.py", line 25, in _intersect_device return k2.intersect_device( File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/fsa_algo.py", line 204, in intersect_device out_fsas = k2.utils.fsa_from_binary_function_tensor(a_fsas, b_fsas, File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/utils.py", line 581, in fsa_from_binary_function_tensor value = index_select(a_value, a_arc_map, default_value=filler) \ File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 161, in index_select ans = _IndexSelectFunction.apply(src, index, default_value) File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 67, in forward return _k2.index_select(src, index, default_value) RuntimeError: Specified device cuda:0 does not match device of data cuda:-2 Exception raised from from_blob at /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/include/ATen/Functions.h:2267 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fe9a54c82f2 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const*, unsigned int, std::string const&) + 0x5b (0x7fe9a54c567b in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so) frame #2: + 0x28200 (0x7fe904576200 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so) frame #3: + 0x10e0a1 (0x7fe90465c0a1 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so) frame #4: + 0x84bce (0x7fe9045d2bce in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so) frame #5: + 0x8858f (0x7fe9045d658f in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so) frame #6: + 0x9f876 (0x7fe9045ed876 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so) frame #7: + 0x1dfcf (0x7fe90456bfcf in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so) frame #13: THPFunction_apply(_object*, _object*) + 0x8fd (0x7fe9fc41f41d in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libtorch_python.so) Uncaught exception. Entering post mortem debugging Running 'cont' or 'step' will restart the program
/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward() -> return _k2.index_select(src, index, default_value) (Pdb)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/33#issuecomment-911710132, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZOQ7YW7B6MVE3R5CTT756YLANCNFSM5DI5NN6Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
https://k2.readthedocs.io/en/latest/installation/for_developers.html
The above link contains instructions to build a debug version of k2.
I added breakpoint to place where @csukuangfj said. Log is here:
Could you also print the shape of src
and index
?
print(src.shape)
print(index.shape)
to verify that neither of them is empty?
I checked if index or src is empty, and noticed that index is empty when the problem occurs.
(k2) yunusemre.ozkose@boxx-3:/path/to/k2/icefall/egs/from_wav_scp/ASR$ python tdnn_lstm_ctc/decode.py --avg 1 --epoch 8
2021-09-03 08:14:46,220 INFO [decode.py:327] Decoding started
2021-09-03 08:14:46,220 INFO [decode.py:328] {'exp_dir': PosixPath('tdnn_lstm_ctc/exp9_w2v2'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 1024, 'subsampling_factor': 3, 'search_beam': 20, 'output_beam': 5, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'method': 'nbest', 'num_paths': 30, 'max_frames': 1000, 'epoch': 8, 'avg': 1, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 500.0, 'bucketing_sampler': False, 'num_buckets': 30, 'concatenate_cuts': True, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'full_libri': False}
2021-09-03 08:14:46,837 INFO [lexicon.py:96] Loading pre-compiled data/lang_phone/Linv.pt
2021-09-03 08:14:47,150 INFO [decode.py:337] device: cuda:0
2021-09-03 08:14:55,636 INFO [checkpoint.py:75] Loading checkpoint from tdnn_lstm_ctc/exp9_w2v2/epoch-8.pt
/path/to/k2/lhotse/lhotse/dataset/sampling/single_cut.py:170: UserWarning: The first cut drawn in batch collection violates the max_frames or max_cuts constraints - we'll return it anyway. Consider increasing max_frames/max_cuts.
warnings.warn(
cuda:0 cuda:0
torch.Size([36466453]) torch.Size([729618])
cuda:0 cuda:0
torch.Size([36466453]) torch.Size([729618])
cuda:0 cuda:0
torch.Size([729618]) torch.Size([729618])
cuda:0 cuda:0
torch.Size([729618]) torch.Size([729618])
cuda:0 cuda:0
torch.Size([729618]) torch.Size([729618])
cuda:0 cuda:0
torch.Size([729618]) torch.Size([729618])
cuda:0 cuda:0
torch.Size([729618]) torch.Size([729618])
cuda:0 cuda:0
torch.Size([729618]) torch.Size([15309908])
cuda:0 cuda:0
torch.Size([562]) torch.Size([15309908])
cuda:0 cuda:0
torch.Size([729618]) torch.Size([15309908])
cuda:0 cuda:0
torch.Size([729618]) torch.Size([15309908])
cuda:0 cuda:0
torch.Size([729618]) torch.Size([15309908])
cuda:0 cuda:0
torch.Size([15309908]) torch.Size([106588])
cuda:0 cuda:0
torch.Size([15309908]) torch.Size([106588])
cuda:0 cuda:0
torch.Size([15309908]) torch.Size([106588])
cuda:0 cuda:0
torch.Size([106588]) torch.Size([106588])
cuda:0 cuda:0
torch.Size([106588]) torch.Size([106588])
cuda:0 cuda:0
torch.Size([106588]) torch.Size([106588])
cuda:0 cuda:0
torch.Size([30]) torch.Size([1])
2021-09-03 08:14:57,654 INFO [decode.py:274] batch 0, cuts processed until now is 1/171 (0.584795%)
cuda:0 cuda:0
torch.Size([36466453]) torch.Size([1375261])
cuda:0 cuda:0
torch.Size([36466453]) torch.Size([1375261])
cuda:0 cuda:0
torch.Size([1375261]) torch.Size([1375261])
cuda:0 cuda:0
torch.Size([1375261]) torch.Size([1375261])
cuda:0 cuda:0
torch.Size([1375261]) torch.Size([1375261])
cuda:0 cuda:0
torch.Size([1375261]) torch.Size([1375261])
cuda:0 cuda:0
torch.Size([1375261]) torch.Size([1375261])
cuda:0 cuda:0
torch.Size([1375261]) torch.Size([36303965])
cuda:0 cuda:0
torch.Size([2322]) torch.Size([36303965])
cuda:0 cuda:0
torch.Size([1375261]) torch.Size([36303965])
cuda:0 cuda:0
torch.Size([1375261]) torch.Size([36303965])
cuda:0 cuda:0
torch.Size([1375261]) torch.Size([36303965])
cuda:0 cuda:0
torch.Size([36303965]) torch.Size([178240])
cuda:0 cuda:0
torch.Size([36303965]) torch.Size([178240])
cuda:0 cuda:0
torch.Size([36303965]) torch.Size([178240])
cuda:0 cuda:0
torch.Size([178240]) torch.Size([178240])
cuda:0 cuda:0
torch.Size([178240]) torch.Size([178240])
cuda:0 cuda:0
torch.Size([178240]) torch.Size([178240])
cuda:0 cuda:0
torch.Size([30]) torch.Size([1])
cuda:0 cuda:0
torch.Size([36466453]) torch.Size([749184])
cuda:0 cuda:0
torch.Size([36466453]) torch.Size([749184])
cuda:0 cuda:0
torch.Size([749184]) torch.Size([749184])
cuda:0 cuda:0
torch.Size([749184]) torch.Size([749184])
cuda:0 cuda:0
torch.Size([749184]) torch.Size([749184])
cuda:0 cuda:0
torch.Size([749184]) torch.Size([749184])
cuda:0 cuda:0
torch.Size([749184]) torch.Size([749184])
cuda:0 cuda:0
torch.Size([749184]) torch.Size([21094213])
cuda:0 cuda:0
torch.Size([1308]) torch.Size([21094213])
cuda:0 cuda:0
torch.Size([749184]) torch.Size([21094213])
cuda:0 cuda:0
torch.Size([749184]) torch.Size([21094213])
cuda:0 cuda:0
torch.Size([749184]) torch.Size([21094213])
cuda:0 cuda:0
torch.Size([21094213]) torch.Size([101191])
cuda:0 cuda:0
torch.Size([21094213]) torch.Size([101191])
cuda:0 cuda:0
torch.Size([21094213]) torch.Size([101191])
cuda:0 cuda:0
torch.Size([101191]) torch.Size([101191])
cuda:0 cuda:0
torch.Size([101191]) torch.Size([101191])
cuda:0 cuda:0
torch.Size([101191]) torch.Size([101191])
cuda:0 cuda:0
torch.Size([30]) torch.Size([1])
cuda:0 cuda:0
torch.Size([36466453]) torch.Size([183094])
cuda:0 cuda:0
torch.Size([36466453]) torch.Size([183094])
cuda:0 cuda:0
torch.Size([183094]) torch.Size([183094])
cuda:0 cuda:0
torch.Size([183094]) torch.Size([183094])
cuda:0 cuda:0
torch.Size([183094]) torch.Size([183094])
cuda:0 cuda:0
torch.Size([183094]) torch.Size([183094])
cuda:0 cuda:0
torch.Size([183094]) torch.Size([183094])
cuda:0 cuda:0
torch.Size([183094]) torch.Size([0])
Traceback (most recent call last):
File "tdnn_lstm_ctc/decode.py", line 435, in <module>
main()
File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "tdnn_lstm_ctc/decode.py", line 418, in main
results_dict = decode_dataset(
File "tdnn_lstm_ctc/decode.py", line 253, in decode_dataset
hyps_dict = decode_one_batch(
File "tdnn_lstm_ctc/decode.py", line 176, in decode_one_batch
best_path = nbest_decoding(
File "/path/to/k2/icefall/icefall/decode.py", line 208, in nbest_decoding
path_lattice = _intersect_device(
File "/path/to/k2/icefall/icefall/decode.py", line 25, in _intersect_device
return k2.intersect_device(
File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/fsa_algo.py", line 204, in intersect_device
out_fsas = k2.utils.fsa_from_binary_function_tensor(a_fsas, b_fsas,
File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/utils.py", line 581, in fsa_from_binary_function_tensor
value = index_select(a_value, a_arc_map, default_value=filler) \
File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 163, in index_select
ans = _IndexSelectFunction.apply(src, index, default_value)
File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 69, in forward
return _k2.index_select(src, index, default_value)
RuntimeError: Specified device cuda:0 does not match device of data cuda:-2
Exception raised from from_blob at /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/include/ATen/Functions.h:2267 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f42803f82f2 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f42803f567b in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x28200 (0x7f41df4f8200 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #3: <unknown function> + 0x10e0a1 (0x7f41df5de0a1 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x84bce (0x7f41df554bce in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #5: <unknown function> + 0x8858f (0x7f41df55858f in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #6: <unknown function> + 0x9f876 (0x7f41df56f876 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #7: <unknown function> + 0x1dfcf (0x7f41df4edfcf in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
<omitting python frames>
frame #13: THPFunction_apply(_object*, _object*) + 0x8fd (0x7f42d734f41d in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #52: __libc_start_main + 0xe7 (0x7f43096adb97 in /lib/x86_64-linux-gnu/libc.so.6)
@EmreOzkose Could you show us the version of k2 you are using?
$ python3 -m k2.version
should give you such information.
@csukuangfj My version info is :
Collecting environment information...
k2 version: 1.3
Build type: Release
Git SHA1: 6b8a10fa95213da285b8fce6525b2c5ed42198a6
Git date: Tue Aug 3 05:36:48 2021
Cuda used to build k2: 11.1
cuDNN used to build k2: 8.0.5
Python version used to build k2: 3.8
OS used to build k2: Ubuntu 16.04.7 LTS
CMake version: 3.18.4
GCC version: 5.5.0
CMAKE_CUDA_FLAGS: --expt-extended-lambda -gencode arch=compute_35,code=sm_35 --expt-extended-lambda -gencode arch=compute_50,code=sm_50 --expt-extended-lambda -gencode arch=compute_60,code=sm_60 --expt-extended-lambda -gencode arch=compute_61,code=sm_61 --expt-extended-lambda -gencode arch=compute_70,code=sm_70 --expt-extended-lambda -gencode arch=compute_75,code=sm_75 -D_GLIBCXX_USE_CXX11_ABI=0 --compiler-options -Wall --compiler-options -Wno-unknown-pragmas --compiler-options -Wno-strict-overflow
CMAKE_CXX_FLAGS: -D_GLIBCXX_USE_CXX11_ABI=0 -Wno-strict-overflow
PyTorch version used to build k2: 1.8.1
PyTorch is using Cuda: 11.1
NVTX enabled: True
With CUDA: True
Disable debug: True
Sync kernels : False
Disable checks: False
I think I understand the issue. I am trying different architectures and features. Since my memory is small, when I increase number of layer of the model, I have to decrease max_frames
. When I use small number of frames (like 5000), index comes 0 for some batches.
I would recommend you to update your k2.
k2 v1.6 contains several bug fixes, including the one you are facing, I think. As you are using conda, steps to update k2 are fairly simple. Please see https://k2.readthedocs.io/en/latest/installation/conda.html
Thank you so much! I am updating at once.
I want to report here. I updated k2 and run decode.py again. The problem is not occurring now, thank you. However hyps are coming empty :). After now, it is my design's problem :).
Hello,
I am training a TDNN-LSTM model with librispeech recipe on 16k 100 hours data. After training, I run decode.py. I sometimes observe a cuda issue (given below). Have you ever observe something like that? I think it is related to something during training. Because after some trainings, decode.py works well, however after some of trainings, decode.py gives this error. I googled
RuntimeError: Specified device cuda:0 does not match device of data cuda:-2
error, but found nothing. I have Tesla-p100 16gb. I should also mention that 1best works well, but problem occurs during nbest and rescorings.