LF-MMI GPU OOM - Githubissues

There is a GPU OOM problem when I use lf-mmi for training, my token size about 1300 , I want to know how to avoid this problem.

What's your training command? What's the value of --max-duration?

It would be helpful to see the traceback from when it dies.

This is the error log.(When the number of phones is 220, it can run normally) `2022-01-30 05:34:59,582 INFO Loading L.fst INFO from MMI module: device: cuda use pruned_intersect: True use segment info: True self.lo Sequential( (0): Dropout(p=0.1, inplace=False) (1): Linear(in_features=256, out_features=1253, bias=True) ) number of phones 1252 2022-01-30 05:35:05,540 INFO Epoch 0 TRAIN info lr 4e-08 2022-01-30 05:35:05,542 INFO using accumulate grad, new batch size is 4 timeslarger than before 2022-01-30 05:35:06,842 DEBUG TRAIN Batch 0/15013 loss 247.649350 loss_att 77.322586 loss_mmi 110.531494 lr 0.00000004 rank 0 2022-01-30 05:36:13,933 DEBUG TRAIN Batch 100/15013 loss 338.543274 loss_att 106.091759 loss_mmi 123.042969 lr 0.00000104 rank 0 terminate called after throwing an instance of 'c10::CUDAOutOfMemoryError' what(): CUDA out of memory. Tried to allocate 1.73 GiB (GPU 0; 23.70 GiB total capacity; 19.65 GiB already allocated; 1.06 GiB free; 21.29 GiB reserved in total by PyTorch) Exception raised from malloc at /opt/conda/conda-bld/pytorch_1616554793803/work/c10/cuda/CUDACachingAllocator.cpp:288 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f71382b72f2 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: + 0x1bc21 (0x7f7138516c21 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: + 0x1c944 (0x7f7138517944 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3: + 0x1cf63 (0x7f7138517f63 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #4: c10::Allocator::raw_allocate(unsigned long) + 0x2f (0x7f709044afaf in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #5: k2::PytorchCudaContext::Allocate(unsigned long, void**) + 0x5f (0x7f709044b65f in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #6: k2::NewRegion(std::shared_ptr, unsigned long) + 0x175 (0x7f709016b015 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #7: k2::Renumbering::ComputeOld2New() + 0x96 (0x7f70901288f6 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #8: k2::Renumbering::Old2New(bool) + 0xc8 (0x7f70902b5b78 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #9: k2::MultiGraphDenseIntersectPruned::PruneTimeRange(int, int) + 0x907 (0x7f70902c7547 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #10: std::_Function_handler<void (), k2::MultiGraphDenseIntersectPruned::Intersect()::{lambda()#1}>::_M_invoke(std::_Any_data const&) + 0x26e (0x7f70902ca58e in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #11: k2::ThreadPool::ProcessTasks() + 0x16d (0x7f709041027d in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #12: + 0xc9039 (0x7f719dc49039 in /opt/conda/lib/python3.8/site-packages/torch/lib/../../../../libstdc++.so.6) frame #13: + 0x76db (0x7f71c00216db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #14: clone + 0x3f (0x7f71bfd4a71f in /lib/x86_64-linux-gnu/libc.so.6)

Killing subprocess 3803024 Traceback (most recent call last): File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in main() File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) `

Hm, there should be a max_arcs option to MultiGraphDenseIntersectPruned() [I forget the python-level wrapper, probably intersect_dense_pruned()]. Setting that to, e.g. 1000, may resolve the issue. Early in training you can get too many arcs active, and if you are using the "normal" topology (not modified topology), the LF-MMI denominator graph size is quadratic in the number of symbols.

On Sun, Jan 30, 2022 at 1:51 PM abner @.***> wrote:

This is the error log.(When the number of phones is 220, it can run normally) `2022-01-30 05:34:59,582 INFO Loading L.fst INFO from MMI module: device: cuda use pruned_intersect: True use segment info: True self.lo Sequential( (0): Dropout(p=0.1, inplace=False) (1): Linear(in_features=256, out_features=1253, bias=True) ) number of phones 1252 2022-01-30 05:35:05,540 INFO Epoch 0 TRAIN info lr 4e-08 2022-01-30 05:35:05,542 INFO using accumulate grad, new batch size is 4 timeslarger than before 2022-01-30 05:35:06,842 DEBUG TRAIN Batch 0/15013 loss 247.649350 loss_att 77.322586 loss_mmi 110.531494 lr 0.00000004 rank 0 2022-01-30 05:36:13,933 DEBUG TRAIN Batch 100/15013 loss 338.543274 loss_att 106.091759 loss_mmi 123.042969 lr 0.00000104 rank 0 terminate called after throwing an instance of 'c10::CUDAOutOfMemoryError' what(): CUDA out of memory. Tried to allocate 1.73 GiB (GPU 0; 23.70 GiB total capacity; 19.65 GiB already allocated; 1.06 GiB free; 21.29 GiB reserved in total by PyTorch) Exception raised from malloc at /opt/conda/conda-bld/pytorch_1616554793803/work/c10/cuda/CUDACachingAllocator.cpp:288 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f71382b72f2 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1 https://github.com/k2-fsa/icefall/issues/1: + 0x1bc21 (0x7f7138516c21 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2 https://github.com/k2-fsa/icefall/issues/2: + 0x1c944 (0x7f7138517944 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3 https://github.com/k2-fsa/icefall/pull/3: + 0x1cf63 (0x7f7138517f63 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #4 https://github.com/k2-fsa/icefall/pull/4: c10::Allocator::raw_allocate(unsigned long) + 0x2f (0x7f709044afaf in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #5 https://github.com/k2-fsa/icefall/pull/5: k2::PytorchCudaContext::Allocate(unsigned long, void**) + 0x5f (0x7f709044b65f in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #6 https://github.com/k2-fsa/icefall/pull/6: k2::NewRegion(std::shared_ptrk2::Context, unsigned long) + 0x175 (0x7f709016b015 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #7 https://github.com/k2-fsa/icefall/pull/7: k2::Renumbering::ComputeOld2New() + 0x96 (0x7f70901288f6 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #8 https://github.com/k2-fsa/icefall/pull/8: k2::Renumbering::Old2New(bool) + 0xc8 (0x7f70902b5b78 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #9 https://github.com/k2-fsa/icefall/pull/9: k2::MultiGraphDenseIntersectPruned::PruneTimeRange(int, int) + 0x907 (0x7f70902c7547 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #10 https://github.com/k2-fsa/icefall/pull/10: std::_Function_handler<void (), k2::MultiGraphDenseIntersectPruned::Intersect()::{lambda()#1 https://github.com/k2-fsa/icefall/issues/1}>::_M_invoke(std::_Any_data const&) + 0x26e (0x7f70902ca58e in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #11 https://github.com/k2-fsa/icefall/issues/11: k2::ThreadPool::ProcessTasks() + 0x16d (0x7f709041027d in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #12 https://github.com/k2-fsa/icefall/pull/12: + 0xc9039 (0x7f719dc49039 in /opt/conda/lib/python3.8/site-packages/torch/lib/../../../../libstdc++.so.6) frame #13 https://github.com/k2-fsa/icefall/pull/13: + 0x76db (0x7f71c00216db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #14 https://github.com/k2-fsa/icefall/pull/14: clone + 0x3f (0x7f71bfd4a71f in /lib/x86_64-linux-gnu/libc.so.6)

Killing subprocess 3803024 Traceback (most recent call last): File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in main() File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) `

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/196#issuecomment-1025076375, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3WRWXKP7NN3AZJLMLUYTGX5ANCNFSM5NC2HWKQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

Hm, there should be a max_arcs option to MultiGraphDenseIntersectPruned() [I forget the python-level wrapper, probably intersect_dense_pruned()]. Setting that to, e.g. 1000, may resolve the issue. Early in training you can get too many arcs active, and if you are using the "normal" topology (not modified topology), the LF-MMI denominator graph size is quadratic in the number of symbols. … On Sun, Jan 30, 2022 at 1:51 PM abner @.*> wrote: This is the error log.(When the number of phones is 220, it can run normally) 2022-01-30 05:34:59,582 INFO Loading L.fst INFO from MMI module: device: cuda use pruned_intersect: True use segment info: True self.lo Sequential( (0): Dropout(p=0.1, inplace=False) (1): Linear(in_features=256, out_features=1253, bias=True) ) number of phones 1252 2022-01-30 05:35:05,540 INFO Epoch 0 TRAIN info lr 4e-08 2022-01-30 05:35:05,542 INFO using accumulate grad, new batch size is 4 timeslarger than before 2022-01-30 05:35:06,842 DEBUG TRAIN Batch 0/15013 loss 247.649350 loss_att 77.322586 loss_mmi 110.531494 lr 0.00000004 rank 0 2022-01-30 05:36:13,933 DEBUG TRAIN Batch 100/15013 loss 338.543274 loss_att 106.091759 loss_mmi 123.042969 lr 0.00000104 rank 0 terminate called after throwing an instance of 'c10::CUDAOutOfMemoryError' what(): CUDA out of memory. Tried to allocate 1.73 GiB (GPU 0; 23.70 GiB total capacity; 19.65 GiB already allocated; 1.06 GiB free; 21.29 GiB reserved in total by PyTorch) Exception raised from malloc at /opt/conda/conda-bld/pytorch_1616554793803/work/c10/cuda/CUDACachingAllocator.cpp:288 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f71382b72f2 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1 <#1>: + 0x1bc21 (0x7f7138516c21 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2 <#2>: + 0x1c944 (0x7f7138517944 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3 <#3>: + 0x1cf63 (0x7f7138517f63 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #4 <#4>: c10::Allocator::raw_allocate(unsigned long) + 0x2f (0x7f709044afaf in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #5 <#5>: k2::PytorchCudaContext::Allocate(unsigned long, void**) + 0x5f (0x7f709044b65f in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #6 <#6>: k2::NewRegion(std::shared_ptrk2::Context, unsigned long) + 0x175 (0x7f709016b015 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #7 <#7>: k2::Renumbering::ComputeOld2New() + 0x96 (0x7f70901288f6 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #8 <#8>: k2::Renumbering::Old2New(bool) + 0xc8 (0x7f70902b5b78 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #9 <#9>: k2::MultiGraphDenseIntersectPruned::PruneTimeRange(int, int) + 0x907 (0x7f70902c7547 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #10 <#10>: std::_Function_handler<void (), k2::MultiGraphDenseIntersectPruned::Intersect()::{lambda()#1 <#1>}>::_M_invoke(std::_Any_data const&) + 0x26e (0x7f70902ca58e in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #11 <#11>: k2::ThreadPool::ProcessTasks() + 0x16d (0x7f709041027d in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #12 <#12>: + 0xc9039 (0x7f719dc49039 in /opt/conda/lib/python3.8/site-packages/torch/lib/../../../../libstdc++.so.6) frame #13 <#13>: + 0x76db (0x7f71c00216db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #14 <#14>: clone + 0x3f (0x7f71bfd4a71f in /lib/x86_64-linux-gnu/libc.so.6) Killing subprocess 3803024 Traceback (most recent call last): File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in main() File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) — Reply to this email directly, view it on GitHub <#196 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3WRWXKP7NN3AZJLMLUYTGX5ANCNFSM5NC2HWKQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you commented.Message ID: *@.>

Thanks, Is it max_active_states? Will lowering this parameter lead to poor training accuracy?

It's better to set max_active_arcs. It may only be present in newer versions of k2. max_active_states is a bit less precise because some states can have many arcs leaving them.

On Mon, Jan 31, 2022 at 2:36 PM abner @.***> wrote:

Hm, there should be a max_arcs option to MultiGraphDenseIntersectPruned() [I forget the python-level wrapper, probably intersect_densepruned()]. Setting that to, e.g. 1000, may resolve the issue. Early in training you can get too many arcs active, and if you are using the "normal" topology (not modified topology), the LF-MMI denominator graph size is quadratic in the number of symbols. … <#m-5600041957703070315_> On Sun, Jan 30, 2022 at 1:51 PM abner @.*> wrote: This is the error log.(When the number of phones is 220, it can run normally) 2022-01-30 05:34:59,582 INFO Loading L.fst INFO from MMI module: device: cuda use pruned_intersect: True use segment info: True self.lo Sequential( (0): Dropout(p=0.1, inplace=False) (1): Linear(in_features=256, out_features=1253, bias=True) ) number of phones 1252 2022-01-30 05:35:05,540 INFO Epoch 0 TRAIN info lr 4e-08 2022-01-30 05:35:05,542 INFO using accumulate grad, new batch size is 4 timeslarger than before 2022-01-30 05:35:06,842 DEBUG TRAIN Batch 0/15013 loss 247.649350 loss_att 77.322586 loss_mmi 110.531494 lr 0.00000004 rank 0 2022-01-30 05:36:13,933 DEBUG TRAIN Batch 100/15013 loss 338.543274 loss_att 106.091759 loss_mmi 123.042969 lr 0.00000104 rank 0 terminate called after throwing an instance of 'c10::CUDAOutOfMemoryError' what(): CUDA out of memory. Tried to allocate 1.73 GiB (GPU 0; 23.70 GiB total capacity; 19.65 GiB already allocated; 1.06 GiB free; 21.29 GiB reserved in total by PyTorch) Exception raised from malloc at /opt/conda/conda-bld/pytorch_1616554793803/work/c10/cuda/CUDACachingAllocator.cpp:288 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f71382b72f2 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1 <#1>:

0x1bc21 (0x7f7138516c21 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2 <#2>: + 0x1c944 (0x7f7138517944 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3 <#3>: + 0x1cf63 (0x7f7138517f63 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #4 <#4>: c10::Allocator::raw_allocate(unsigned long) + 0x2f (0x7f709044afaf in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #5 <#5>: k2::PytorchCudaContext::Allocate(unsigned long, void**) + 0x5f (0x7f709044b65f in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #6 <#6>: k2::NewRegion(std::shared_ptrk2::Context, unsigned long) + 0x175 (0x7f709016b015 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #7 <#7>: k2::Renumbering::ComputeOld2New() + 0x96 (0x7f70901288f6 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #8 <#8>: k2::Renumbering::Old2New(bool) + 0xc8 (0x7f70902b5b78 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #9 <#9>: k2::MultiGraphDenseIntersectPruned::PruneTimeRange(int, int)

0x907 (0x7f70902c7547 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #10 <#10>: std::_Function_handler<void (), k2::MultiGraphDenseIntersectPruned::Intersect()::{lambda()#1 <#1>}>::_M_invoke(std::_Any_data const&) + 0x26e (0x7f70902ca58e in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #11 <#11>: k2::ThreadPool::ProcessTasks() + 0x16d (0x7f709041027d in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #12 <#12>: + 0xc9039 (0x7f719dc49039 in /opt/conda/lib/python3.8/site-packages/torch/lib/../../../../libstdc++.so.6) frame #13 <#13>: + 0x76db (0x7f71c00216db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #14 <#14>: clone + 0x3f (0x7f71bfd4a71f in /lib/x86_64-linux-gnu/libc.so.6) Killing subprocess 3803024 Traceback (most recent call last): File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in main() File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) — Reply to this email directly, view it on GitHub <#196 (comment) https://github.com/k2-fsa/icefall/issues/196#issuecomment-1025076375>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3WRWXKP7NN3AZJLMLUYTGX5ANCNFSM5NC2HWKQ https://github.com/notifications/unsubscribe-auth/AAZFLO3WRWXKP7NN3AZJLMLUYTGX5ANCNFSM5NC2HWKQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you commented.Message ID: @.*>

Thanks, Is it max_active_states? Will lowering this parameter lead to poor training accuracy?

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/196#issuecomment-1025425201, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3PDKI7GDEU5KCTCRLUYYUW7ANCNFSM5NC2HWKQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

k2-fsa / icefall

LF-MMI GPU OOM #196