Closed jasonyong closed 3 years ago
It is hard to tell. Just to confirm, are you able to use another model file without errors. For example, could you please try with 1g.py?
i've tried all the other models(1b 1c 1e 1f...), same error happens
I debug 1d.py after forward process, which is after
output, xent_output = model(features)
When I do output.sum().backward() there exist an error: ERROR: Cannot convert to CuSubMatrix because tensor is not contiguous but if I do xent_output .sum().backward() no error happens, so strange...
I'm unable to reproduce the error. Could you please confirm that you are using the version of Kaldi mentioned in the README?
The kaldi version is d79c896
Two things:
6f329a62
as mentioned in the README, but I suspect this is not the issue. I have tried PR #5 , now the forward and backward run normally. But another segmentation error occurs:
when I run with script:
python3 local/chain/tuning/run_tdnn.py --model-file local/chain/tuning/model/1d.py --stage 6
it outputs like this:
_Running iter=0 of 12 lr=0.002 Num jobs = 2 bash: line 1: 80675 Segmentation fault ( local/chain/tuning/model/1d.py --dir exp/chain/tdnn_sp --mode training --lr 0.002 --frame-shift 0 --egs ark:exp/chain/tdnn_sp/egs/cegs.1.ark --l2-regularize-factor 0.5 --minibatch-size 128,64 --new-model exp/chain/tdnn_sp/0.1.pt exp/chain/tdnn_sp/0.pt ) 2>> exp/chain/tdnn_sp/log/train.0.1.log >> exp/chain/tdnn_sp/log/train.0.1.log run.pl: job failed, log is in exp/chain/tdnn_sp/log/train.0.1.log bash: line 1: 80676 Segmentation fault ( local/chain/tuning/model/1d.py --dir exp/chain/tdnn_sp --mode training --lr 0.002 --frame-shift 1 --egs ark:exp/chain/tdnn_sp/egs/cegs.2.ark --l2-regularize-factor 0.5 --minibatch-size 128,64 --new-model exp/chain/tdnn_sp/0.2.pt exp/chain/tdnn_sp/0.pt ) 2>> exp/chain/tdnn_sp/log/train.0.2.log >> exp/chain/tdnn_sp/log/train.0.2.log run.pl: job failed, log is in exp/chain/tdnnsp/log/train.0.2.log
When I just run on 1 GPU, use this script:
python3 local/chain/tuning/model/1d.py --dir exp/chain/tdnn_sp --mode training --lr 0.002 --frame-shift 0 --egs ark:exp/chain/tdnn_sp/egs/cegs.1.ark --l2-regularize-factor 0.5 --minibatch-size 128,64 --new-model exp/chain/tdnn_sp/0.1.pt exp/chain/tdnn_sp/0.pt
the last output is:
_Overall objf=-0.41054922342300415 objf=-0.3821493089199066, l2=-0.04791105166077614, xent_objf=-4.288857936859131 objf=-0.4246918559074402, l2=-0.04850480332970619, xent_objf=-4.277522087097168 objf=-0.33227378129959106, l2=-0.05701010301709175, xent_objf=-4.322901725769043 objf=-0.36673516035079956, l2=-0.04990071803331375, xent_objf=-4.098158836364746 objf=-0.3561687171459198, l2=-0.04755700007081032, xent_objf=-4.121530055999756 objf=-0.3618038296699524, l2=-0.05268600583076477, xent_objf=-4.12954044342041 objf=-0.3476528525352478, l2=-0.049845851957798004, xent_objf=-4.248356342315674 objf=-0.3530285656452179, l2=-0.04898693785071373, xent_objf=-4.082286834716797 objf=-0.3549785315990448, l2=-0.05094904080033302, xentobjf=-4.141960144042969 Segmentation fault
Seems like an error happens in python exit stage, because "0.1.pt" model is already saved. When I insert an pdb.set_trace() in last line of training, info per jump is :
_-> pdb.set_trace() (Pdb) n --Call--
/search/odin/jiangyongjun/anaconda3/lib/python3.7/threading.py(1263)_shutdown() -> def _shutdown(): (Pdb) n /search/odin/jiangyongjun/anaconda3/lib/python3.7/threading.py(1269)_shutdown() -> if _main_thread._is_stopped: (Pdb) n /search/odin/jiangyongjun/anaconda3/lib/python3.7/threading.py(1272)_shutdown() -> tlock = _main_thread._tstate_lock (Pdb) n /search/odin/jiangyongjun/anaconda3/lib/python3.7/threading.py(1275)_shutdown() -> assert tlock is not None (Pdb) n /search/odin/jiangyongjun/anaconda3/lib/python3.7/threading.py(1276)_shutdown() -> assert tlock.locked() (Pdb) n /search/odin/jiangyongjun/anaconda3/lib/python3.7/threading.py(1277)_shutdown() -> tlock.release() (Pdb) n /search/odin/jiangyongjun/anaconda3/lib/python3.7/threading.py(1278)_shutdown() -> _main_thread._stop() (Pdb) n /search/odin/jiangyongjun/anaconda3/lib/python3.7/threading.py(1279)_shutdown() -> t = _pickSomeNonDaemonThread() (Pdb) n /search/odin/jiangyongjun/anaconda3/lib/python3.7/threading.py(1280)_shutdown() -> while t: (Pdb) n --Return-- /search/odin/jiangyongjun/anaconda3/lib/python3.7/threading.py(1280)_shutdown()->None -> while t: (Pdb) n --Call-- /search/odin/jiangyongjun/anaconda3/lib/python3.7/concurrent/futures/thread.py(33)_python_exit() -> def _python_exit(): (Pdb) n /search/odin/jiangyongjun/anaconda3/lib/python3.7/concurrent/futures/thread.py(35)_python_exit() -> _shutdown = True (Pdb) n /search/odin/jiangyongjun/anaconda3/lib/python3.7/concurrent/futures/thread.py(36)_python_exit() -> items = list(_threads_queues.items()) (Pdb) n /search/odin/jiangyongjun/anaconda3/lib/python3.7/concurrent/futures/thread.py(37)_python_exit() -> for t, q in items: (Pdb) n /search/odin/jiangyongjun/anaconda3/lib/python3.7/concurrent/futures/thread.py(39)_python_exit() -> for t, q in items: (Pdb) n --Return-- /search/odin/jiangyongjun/anaconda3/lib/python3.7/concurrent/futures/thread.py(39)_python_exit()->None -> for t, q in items: (Pdb) n --Call-- /search/odin/jiangyongjun/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/init.py(38)_set_python_exit_flag() -> def _set_python_exit_flag(): (Pdb) n /search/odin/jiangyongjun/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/init.py(40)_set_python_exit_flag() -> python_exit_status = True (Pdb) n --Return-- /search/odin/jiangyongjun/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/init.py(40)_set_python_exit_flag()->None -> python_exit_status = True (Pdb) n --Call-- /search/odin/jiangyongjun/anaconda3/lib/python3.7/logging/init.py(2023)shutdown() -> def shutdown(handlerList=_handlerList): (Pdb) n /search/odin/jiangyongjun/anaconda3/lib/python3.7/logging/init.py(2030)shutdown() -> for wr in reversed(handlerList[:]): (Pdb) n /search/odin/jiangyongjun/anaconda3/lib/python3.7/logging/init.py(2033)shutdown() -> try: (Pdb) n /search/odin/jiangyongjun/anaconda3/lib/python3.7/logging/init.py(2034)shutdown() -> h = wr() (Pdb) n /search/odin/jiangyongjun/anaconda3/lib/python3.7/logging/init.py(2035)shutdown() -> if h: (Pdb) n /search/odin/jiangyongjun/anaconda3/lib/python3.7/logging/init.py(2036)shutdown() -> try: (Pdb) n /search/odin/jiangyongjun/anaconda3/lib/python3.7/logging/init.py(2037)shutdown() -> h.acquire() (Pdb) n /search/odin/jiangyongjun/anaconda3/lib/python3.7/logging/init.py(2038)shutdown() -> h.flush() (Pdb) n /search/odin/jiangyongjun/anaconda3/lib/python3.7/logging/init.py(2039)shutdown() -> h.close() (Pdb) n /search/odin/jiangyongjun/anaconda3/lib/python3.7/logging/init.py(2047)shutdown() -> h.release() (Pdb) n /search/odin/jiangyongjun/anaconda3/lib/python3.7/logging/init.py(2030)shutdown() -> for wr in reversed(handlerList[:]): (Pdb) n --Return-- /search/odin/jiangyongjun/anaconda3/lib/python3.7/logging/init.py(2030)shutdown()->None -> for wr in reversed(handlerList[:]): (Pdb) n --Call-- /search/odin/jiangyongjun/anaconda3/lib/python3.7/multiprocessing/util.py(285)_exit_function() -> def _exit_function(info=info, debug=debug, _run_finalizers=_run_finalizers, (Pdb) n /search/odin/jiangyongjun/anaconda3/lib/python3.7/multiprocessing/util.py(294)_exit_function() -> if not _exiting: (Pdb) n /search/odin/jiangyongjun/anaconda3/lib/python3.7/multiprocessing/util.py(295)_exit_function() -> _exiting = True (Pdb) n /search/odin/jiangyongjun/anaconda3/lib/python3.7/multiprocessing/util.py(297)_exit_function() -> info('process shutting down') (Pdb) n /search/odin/jiangyongjun/anaconda3/lib/python3.7/multiprocessing/util.py(298)_exit_function() -> debug('running all "atexit" finalizers with priority >= 0') (Pdb) n /search/odin/jiangyongjun/anaconda3/lib/python3.7/multiprocessing/util.py(299)_exit_function() -> _run_finalizers(0) (Pdb) n /search/odin/jiangyongjun/anaconda3/lib/python3.7/multiprocessing/util.py(301)_exit_function() -> if current_process() is not None: (Pdb) n /search/odin/jiangyongjun/anaconda3/lib/python3.7/multiprocessing/util.py(315)_exit_function() -> for p in active_children(): (Pdb) n /search/odin/jiangyongjun/anaconda3/lib/python3.7/multiprocessing/util.py(320)_exit_function() -> for p in active_children(): (Pdb) n /search/odin/jiangyongjun/anaconda3/lib/python3.7/multiprocessing/util.py(324)_exit_function() -> debug('running the remaining "atexit" finalizers') (Pdb) n /search/odin/jiangyongjun/anaconda3/lib/python3.7/multiprocessing/util.py(325)_exit_function() -> _run_finalizers() (Pdb) n --Return-- /search/odin/jiangyongjun/anaconda3/lib/python3.7/multiprocessing/util.py(325)_exit_function()->None -> _run_finalizers() (Pdb) n --Call-- Exception ignored in: <function WeakKeyDictionary.init.
.remove at 0x7f4ae2cf6d90> Traceback (most recent call last): File "/search/odin/jiangyongjun/anaconda3/lib/python3.7/weakref.py", line 358, in remove File "/search/odin/jiangyongjun/anaconda3/lib/python3.7/bdb.py", line 90, in trace_dispatch File "/search/odin/jiangyongjun/anaconda3/lib/python3.7/bdb.py", line 134, in dispatch_call File "/search/odin/jiangyongjun/anaconda3/lib/python3.7/pdb.py", line 251, in user_call File "/search/odin/jiangyongjun/anaconda3/lib/python3.7/pdb.py", line 351, in interaction File "/search/odin/jiangyongjun/anaconda3/lib/python3.7/pdb.py", line 1457, in print_stack_entry File "/search/odin/jiangyongjun/anaconda3/lib/python3.7/bdb.py", line 541, in format_stackentry TypeError: 'NoneType' object is not callable Segmentation fault
I see that this type of error occurs with pytorch 1.5 but not 1.6. Would it be possible for your to upgrade to pytorch 1.6?
I am able to run without segfaults on pytorch version 1.4 and 1.6. It seems to be very specific to pytorch 1.5. Specifically, the fault is raised due to the call topkwrap.kaldi.InstantiateKaldiCuda()
.
I've upgrade pytorch to 1.6, now everything works fine! Thank you
Hello, when I run the mini_librispeech recipe, an error occurs in the training process, seems like backward fails. Below is the train log, can you figure out what's the problem?
_1 # local/chain/tuning/model/1d.py --dir exp/chain/tdnn_sp --mode training --lr 0.002 --frame-shift 0 --egs ark:exp/chain/tdnn_sp/egs/cegs.1.ark --l2-regularize-factor 0.5 --minibatch-size 128,64 --new-model exp/chain/tdnn_sp/0.1.pt exp/chain/tdnn_sp/0.pt 2 # Started at Thu Nov 26 19:32:21 CST 2020 3 # 4 LOG ([5.5]:SelectGpuId():cu-device.cc:223) CUDA setup operating under Compute Exclusive Mode. 5 LOG ([5.5]:FinalizeActiveGpu():cu-device.cc:308) The active GPU is [1]: TITAN V free:11517M, used:518M, total:12036M, free/total:0.956892 version 7.0 6 LOG ([5.5]:PrintSpecificStats():nnet-example-utils.cc:1159) Merged specific eg types as follows [format:={->,->.../d=},={...},... (note,egs-size == number of input frames including context).
7 LOG ([5.5]:PrintSpecificStats():nnet-example-utils.cc:1189) 169={128->60,d=46}
8 LOG ([5.5]:PrintAggregateStats():nnet-example-utils.cc:1155) Processed 7726 egs of avg. size 169 into 60 minibatches, discarding 0.5954% of egs. Avg minibatch size was 128, #distinct types of egs/minibatches was 1/1
9 WARNING ([5.5]:Initialize():cu-device.cc:104) For multi-threaded code that might use GPU, you should call CuDevice::Instantiate().AllowMultithreading() at the start of the program.
10 Loaded base model from exp/chain/tdnn_sp/0.ptobjf=-1.0645081996917725, l2=-0.0, xent_objf=-7.749322414398193
11 Traceback (most recent call last):
12 File "local/chain/tuning/model/1d.py", line 191, in
13 frame_shift=args.frame_shift)
14 File "local/chain/tuning/model/1d.py", line 88, in train_lfmmi_one_iter
15 deriv.backward()
16 File "/search/odin/jiangyongjun/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward
17 torch.autograd.backward(self, gradient, retain_graph, create_graph)
18 File "/search/odin/jiangyongjun/anaconda3/lib/python3.7/site-packages/torch/autograd/init.py", line 100, in backward
19 allow_unreachable=True) # allowunreachable flag
20 RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
I use (python3 local/chain/tuning/run_tdnn.py --model-file local/chain/tuning/model/1d.py) to run the script.
Some of my system info is: pytorch1.5+cuda9.2 NVIDIA Titan V gcc 5.3.1