CUDA out of memory - Githubissues

k2-fsa / snowfall

Moved to https://github.com/k2-fsa/icefall

Apache License 2.0

143 stars 42 forks source link

$cuda_cmd log/stage6_train.log\ CUDA_VISIBLE_DEVICES="4" python3 ./mmi_att_transformer_train.py \ --world-size 1\ --full-libri false\ --use-ali-model false \ --num-workers-train 1\ --num-workers-valid 1 $decode_cmd log/stage7_decode.log\ CUDA_VISIBLE_DEVICES="4" python3 ./mmi_att_transformer_decode.py

2021-06-15 11:40:13,293 INFO [common.py:398] [test-clean] %WER 5.78% [3037 / 52576, 571 ins, 181 del, 2285 sub ] 2021-06-15 11:49:09,503 INFO [common.py:398] [test-other] %WER 15.14% [7925 / 52343, 1258 ins, 542 del, 6125 sub ]

[md510@node02 simple_v1]$ python3 -m k2.version Collecting environment information... k2 version: 0.3.5 Build type: Release Git SHA1: 81ad3a580361e20b828d5eb1120999ecd0d7c675 Git date: Sat Jun 5 11:36:50 2021 Cuda used to build k2: 10.2 cuDNN used to build k2: 8.0.2 Python version used to build k2: 3.8 OS used to build k2: Ubuntu 16.04.7 LTS CMake version: 3.18.4 GCC version: 5.5.0 CMAKE_CUDA_FLAGS: --expt-extended-lambda -gencode arch=compute_35,code=sm_35 --expt-extended-lambda -gencode arch=compute_50,code=sm_50 --expt-extended-lambda -gencode arch=compute_60,code=sm_60 --expt-extended-lambda -gencode arch=compute_61,code=sm_61 --expt-extended-lambda -gencode arch=compute_70,code=sm_70 --expt-extended-lambda -gencode arch=compute_75,code=sm_75 -D_GLIBCXX_USE_CXX11_ABI=0 --compiler-options -Wall --compiler-options -Wno-unknown-pragmas --compiler-options -Wno-strict-overflow CMAKE_CXX_FLAGS: -D_GLIBCXX_USE_CXX11_ABI=0 -Wno-strict-overflow PyTorch version used to build k2: 1.8.1 PyTorch is using Cuda: 10.2 NVTX enabled: True With CUDA: True Disable debug: True Sync kernels : False Disable checks: False

$cuda_cmd log/stage5_train.log\ CUDA_VISIBLE_DEVICES="2,3,4" python3 ./mmi_att_transformer_train_seame.py \ --world-size 3\ --use-ali-model false \ --num-workers-train 1\ --num-workers-valid 1

# CUDA_VISIBLE_DEVICES=2,3,4 python3 ./mmi_att_transformer_train_seame.py --world-size 3 --use-ali-model false --num-workers-train 1 --num-workers-valid 1 # Invoked at Mon Jun 21 11:13:10 SGT 2021 from node03 # # Started at Mon Jun 21 11:14:08 +08 2021 on node02 Traceback (most recent call last): File "./mmi_att_transformer_train_seame.py", line 724, in <module> main() File "./mmi_att_transformer_train_seame.py", line 717, in main mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True) File "/home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException: -- Process 0 terminated with the following error: Traceback (most recent call last): File "/home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/home3/md510/w2020/k2_fsa_2021/snowfall/egs/seame/asr/simple_v1/mmi_att_transformer_train_seame.py", line 630, in run objf, valid_objf, global_batch_idx_train = train_one_epoch( File "/home3/md510/w2020/k2_fsa_2021/snowfall/egs/seame/asr/simple_v1/mmi_att_transformer_train_seame.py", line 257, in train_one_epoch curr_batch_objf, curr_batch_frames, curr_batch_all_frames = get_objf( File "/home3/md510/w2020/k2_fsa_2021/snowfall/egs/seame/asr/simple_v1/mmi_att_transformer_train_seame.py", line 113, in get_objf mmi_loss, tot_frames, all_frames = loss_fn(nnet_output, texts, supervision_segments) File "/home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home3/md510/w2020/k2_fsa_2021/snowfall/snowfall/objectives/mmi.py", line 222, in forward return func(nnet_output=nnet_output, File "/home3/md510/w2020/k2_fsa_2021/snowfall/snowfall/objectives/mmi.py", line 97, in _compute_mmi_loss_exact_optimized num_den_tot_scores = num_den_lats.get_tot_scores(log_semiring=True, File "/home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/k2/fsa.py", line 644, in get_tot_scores tot_scores = k2.autograd._GetTotScoresFunction.apply( File "/home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/k2/autograd.py", line 49, in forward tot_scores = fsas._get_tot_scores(use_double_scores=use_double_scores, File "/home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/k2/fsa.py", line 623, in _get_tot_scores forward_scores = self._get_forward_scores(use_double_scores, File "/home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/k2/fsa.py", line 573, in _get_forward_scores entering_arc_batches=self._get_entering_arc_batches(), File "/home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/k2/fsa.py", line 513, in _get_entering_arc_batches incoming_arcs=self._get_incoming_arcs(), File "/home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/k2/fsa.py", line 499, in _get_incoming_arcs cache[name] = _k2.get_incoming_arcs(self.arcs, RuntimeError: CUDA out of memory. Tried to allocate 17179869182.18 GiB (GPU 0; 44.49 GiB total capacity; 31.00 GiB already allocated; 7.62 GiB free; 35.77 GiB reserved in total by PyTorch) Exception raised from malloc at /opt/conda/conda-bld/pytorch_1616554788289/work/c10/cuda/CUDACachingAllocator.cpp:288 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x2aab147e12f2 in /home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x1bc21 (0x2aab1457dc21 in /home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: <unknown function> + 0x1c944 (0x2aab1457e944 in /home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3: <unknown function> + 0x1cf63 (0x2aab1457ef63 in /home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #4: k2::PytorchCudaContext::Allocate(unsigned long, void**) + 0x5e (0x2aab2fe7aade in /home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/libk2context.so) frame #5: k2::NewRegion(std::shared_ptr<k2::Context>, unsigned long) + 0x11e (0x2aab2fbd876e in /home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/libk2context.so) frame #6: <unknown function> + 0x23a61d (0x2aab2fd4661d in /home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/libk2context.so) frame #7: k2::GetTransposeReordering(k2::Ragged<int>&, int) + 0x2ff (0x2aab2fd641ff in /home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/libk2context.so) frame #8: k2::GetIncomingArcs(k2::Ragged<k2::Arc>&, k2::Array1<int> const&) + 0x11a (0x2aab2fc4407a in /home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/libk2context.so) frame #9: <unknown function> + 0x444ed (0x2aab2eb634ed in /home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so) frame #10: <unknown function> + 0x1bd5f (0x2aab2eb3ad5f in /home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so) frame #11: PyCFunction_Call + 0x54 (0x55555567fdf4 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #12: _PyObject_MakeTpCall + 0x31e (0x55555568ef2e in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #13: _PyEval_EvalFrameDefault + 0x534b (0x555555728f6b in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #14: <unknown function> + 0x1b1e86 (0x555555705e86 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #15: _PyEval_EvalFrameDefault + 0x4ca3 (0x5555557288c3 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #16: <unknown function> + 0x1b1e86 (0x555555705e86 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #17: _PyEval_EvalFrameDefault + 0x4ca3 (0x5555557288c3 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #18: <unknown function> + 0x1b1e86 (0x555555705e86 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #19: _PyEval_EvalFrameDefault + 0x4ca3 (0x5555557288c3 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #20: _PyEval_EvalCodeWithName + 0x2c3 (0x555555704503 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #21: <unknown function> + 0x1b2007 (0x555555706007 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #22: _PyEval_EvalFrameDefault + 0x1782 (0x5555557253a2 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #23: _PyFunction_Vectorcall + 0x1a6 (0x555555705706 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #24: PyObject_CallObject + 0x53 (0x55555570dd93 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #25: THPFunction_apply(_object*, _object*) + 0x8fd (0x2aaac76a83fd in /home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #26: PyCFunction_Call + 0xf9 (0x55555567fe99 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #27: _PyObject_MakeTpCall + 0x31e (0x55555568ef2e in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #28: _PyEval_EvalFrameDefault + 0x534b (0x555555728f6b in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #29: _PyEval_EvalCodeWithName + 0x2c3 (0x555555704503 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #30: <unknown function> + 0x1b2007 (0x555555706007 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #31: _PyEval_EvalFrameDefault + 0x1782 (0x5555557253a2 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #32: _PyEval_EvalCodeWithName + 0x2c3 (0x555555704503 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #33: _PyFunction_Vectorcall + 0x378 (0x5555557058d8 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #34: _PyEval_EvalFrameDefault + 0x1782 (0x5555557253a2 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #35: _PyFunction_Vectorcall + 0x1a6 (0x555555705706 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #36: <unknown function> + 0x1b1f91 (0x555555705f91 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #37: PyObject_Call + 0x5e (0x5555556790be in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #38: _PyEval_EvalFrameDefault + 0x21c1 (0x555555725de1 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #39: _PyEval_EvalCodeWithName + 0x2c3 (0x555555704503 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #40: _PyObject_FastCallDict + 0x2c1 (0x555555673df1 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #41: _PyObject_Call_Prepend + 0x63 (0x55555567e983 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #42: <unknown function> + 0x181b99 (0x5555556d5b99 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #43: _PyObject_MakeTpCall + 0x31e (0x55555568ef2e in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #44: _PyEval_EvalFrameDefault + 0x4f2e (0x555555728b4e in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #45: _PyEval_EvalCodeWithName + 0x2c3 (0x555555704503 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #46: _PyFunction_Vectorcall + 0x378 (0x5555557058d8 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #47: _PyEval_EvalFrameDefault + 0x1782 (0x5555557253a2 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #48: _PyEval_EvalCodeWithName + 0x2c3 (0x555555704503 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #49: _PyFunction_Vectorcall + 0x378 (0x5555557058d8 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #50: _PyEval_EvalFrameDefault + 0x1782 (0x5555557253a2 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #51: _PyFunction_Vectorcall + 0x1a6 (0x555555705706 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #52: PyObject_Call + 0x5e (0x5555556790be in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #53: _PyEval_EvalFrameDefault + 0x21c1 (0x555555725de1 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #54: _PyFunction_Vectorcall + 0x1a6 (0x555555705706 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #55: PyObject_Call + 0x5e (0x5555556790be in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #56: _PyEval_EvalFrameDefault + 0x21c1 (0x555555725de1 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #57: _PyFunction_Vectorcall + 0x1a6 (0x555555705706 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #58: _PyEval_EvalFrameDefault + 0xa4b (0x55555572466b in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #59: _PyEval_EvalCodeWithName + 0x2c3 (0x555555704503 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #60: _PyFunction_Vectorcall + 0x378 (0x5555557058d8 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #61: _PyEval_EvalFrameDefault + 0xa4b (0x55555572466b in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #62: _PyFunction_Vectorcall + 0x1a6 (0x555555705706 in /home3/md510/anaconda3/envs/foo_k2/bin/python3) frame #63: _PyEval_EvalFrameDefault + 0x92f (0x55555572454f in /home3/md510/anaconda3/envs/foo_k2/bin/python3) # Ended (code 256) at Mon Jun 21 11:19:00 SGT 2021, elapsed time 350 seconds

You can mess with the minibatch size, which might help. But finding the source is a good idea too. Are you using an alignment model? (If not, the posteriors at the start can be very flat, which can cause too many states to stay within the pruning beam). What is the size of the phone set?

On Mon, Jun 21, 2021 at 11:39 AM shanguanma @.***> wrote:

I try to use the new snowfall and k2-fsa(0.3.5) to Reproduce your recipe(Librispeech) results, I use the below script:

$cuda_cmd log/stage6_train.log\

CUDA_VISIBLE_DEVICES="4" python3 ./mmi_att_transformer_train.py \
                                   --world-size 1\

                                   --full-libri false\

                                   --use-ali-model false \

                                   --num-workers-train 1\

                                   --num-workers-valid 1
$decode_cmd log/stage7_decode.log\

CUDA_VISIBLE_DEVICES="4" python3 ./mmi_att_transformer_decode.py

Get result is as follows:

2021-06-15 11:40:13,293 INFO [common.py:398] [test-clean] %WER 5.78% [3037 / 52576, 571 ins, 181 del, 2285 sub ]

2021-06-15 11:49:09,503 INFO [common.py:398] [test-other] %WER 15.14% [7925 / 52343, 1258 ins, 542 del, 6125 sub ]

environment is summary as follows:

@.*** simple_v1]$ python3 -m k2.version

Collecting environment information...

k2 version: 0.3.5

Build type: Release

Git SHA1: 81ad3a580361e20b828d5eb1120999ecd0d7c675

Git date: Sat Jun 5 11:36:50 2021

Cuda used to build k2: 10.2

cuDNN used to build k2: 8.0.2

Python version used to build k2: 3.8

OS used to build k2: Ubuntu 16.04.7 LTS

CMake version: 3.18.4

GCC version: 5.5.0

CMAKE_CUDA_FLAGS: --expt-extended-lambda -gencode arch=compute_35,code=sm_35 --expt-extended-lambda -gencode arch=compute_50,code=sm_50 --expt-extended-lambda -gencode arch=compute_60,code=sm_60 --expt-extended-lambda -gencode arch=compute_61,code=sm_61 --expt-extended-lambda -gencode arch=compute_70,code=sm_70 --expt-extended-lambda -gencode arch=compute_75,code=sm_75 -D_GLIBCXX_USE_CXX11_ABI=0 --compiler-options -Wall --compiler-options -Wno-unknown-pragmas --compiler-options -Wno-strict-overflow

CMAKE_CXX_FLAGS: -D_GLIBCXX_USE_CXX11_ABI=0 -Wno-strict-overflow

PyTorch version used to build k2: 1.8.1

PyTorch is using Cuda: 10.2

NVTX enabled: True

With CUDA: True

Disable debug: True

Sync kernels : False

Disable checks: False

Now I use other corpus(e.g. seame), at train acoustic model, The program keeps prompting CUDA out of memory Note: GPU is RTX8000（48G per GPU）, my running code is as follows:

$cuda_cmd log/stage5_train.log\

CUDA_VISIBLE_DEVICES="2,3,4" python3 ./mmi_att_transformer_train_seame.py \
                                --world-size 3\

                                --use-ali-model false \

                               --num-workers-train 1\

                               --num-workers-valid 1
error log is as follows:

CUDA_VISIBLE_DEVICES=2,3,4 python3 ./mmi_att_transformer_train_seame.py --world-size 3 --use-ali-model false --num-workers-train 1 --num-workers-valid 1

Invoked at Mon Jun 21 11:13:10 SGT 2021 from node03

#

Started at Mon Jun 21 11:14:08 +08 2021 on node02

Traceback (most recent call last):

File "./mmi_att_transformer_train_seame.py", line 724, in
main()
File "./mmi_att_transformer_train_seame.py", line 717, in main
mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
File "/home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:

Traceback (most recent call last):

File "/home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home3/md510/w2020/k2_fsa_2021/snowfall/egs/seame/asr/simple_v1/mmi_att_transformer_train_seame.py", line 630, in run
objf, valid_objf, global_batch_idx_train = train_one_epoch(
File "/home3/md510/w2020/k2_fsa_2021/snowfall/egs/seame/asr/simple_v1/mmi_att_transformer_train_seame.py", line 257, in train_one_epoch
curr_batch_objf, curr_batch_frames, curr_batch_all_frames = get_objf(
File "/home3/md510/w2020/k2_fsa_2021/snowfall/egs/seame/asr/simple_v1/mmi_att_transformer_train_seame.py", line 113, in get_objf
mmi_loss, tot_frames, all_frames = loss_fn(nnet_output, texts, supervision_segments)
File "/home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home3/md510/w2020/k2_fsa_2021/snowfall/snowfall/objectives/mmi.py", line 222, in forward
return func(nnet_output=nnet_output,
File "/home3/md510/w2020/k2_fsa_2021/snowfall/snowfall/objectives/mmi.py", line 97, in _compute_mmi_loss_exact_optimized
num_den_tot_scores = num_den_lats.get_tot_scores(log_semiring=True,
File "/home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/k2/fsa.py", line 644, in get_tot_scores
tot_scores = k2.autograd._GetTotScoresFunction.apply(
File "/home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/k2/autograd.py", line 49, in forward
tot_scores = fsas._get_tot_scores(use_double_scores=use_double_scores,
File "/home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/k2/fsa.py", line 623, in _get_tot_scores
forward_scores = self._get_forward_scores(use_double_scores,
File "/home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/k2/fsa.py", line 573, in _get_forward_scores
entering_arc_batches=self._get_entering_arc_batches(),
File "/home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/k2/fsa.py", line 513, in _get_entering_arc_batches
incoming_arcs=self._get_incoming_arcs(),
File "/home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/k2/fsa.py", line 499, in _get_incoming_arcs
cache[name] = _k2.get_incoming_arcs(self.arcs,
RuntimeError: CUDA out of memory. Tried to allocate 17179869182.18 GiB (GPU 0; 44.49 GiB total capacity; 31.00 GiB already allocated; 7.62 GiB free; 35.77 GiB reserved in total by PyTorch)

Exception raised from malloc at /opt/conda/conda-bld/pytorch_1616554788289/work/c10/cuda/CUDACachingAllocator.cpp:288 (most recent call first):

frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x2aab147e12f2 in /home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/torch/lib/libc10.so)

frame #1: + 0x1bc21 (0x2aab1457dc21 in /home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)

frame #2: + 0x1c944 (0x2aab1457e944 in /home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)

frame #3: + 0x1cf63 (0x2aab1457ef63 in /home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)

frame #4: k2::PytorchCudaContext::Allocate(unsigned long, void**) + 0x5e (0x2aab2fe7aade in /home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/libk2context.so)

frame #5: k2::NewRegion(std::shared_ptr, unsigned long) + 0x11e (0x2aab2fbd876e in /home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/libk2context.so)

frame #6: + 0x23a61d (0x2aab2fd4661d in /home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/libk2context.so)

frame #7: k2::GetTransposeReordering(k2::Ragged&, int) + 0x2ff (0x2aab2fd641ff in /home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/libk2context.so)

frame #8: k2::GetIncomingArcs(k2::Ragged&, k2::Array1 const&) + 0x11a (0x2aab2fc4407a in /home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/libk2context.so)

frame #9: + 0x444ed (0x2aab2eb634ed in /home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)

frame #10: + 0x1bd5f (0x2aab2eb3ad5f in /home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)

frame #11: PyCFunction_Call + 0x54 (0x55555567fdf4 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #12: _PyObject_MakeTpCall + 0x31e (0x55555568ef2e in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #13: _PyEval_EvalFrameDefault + 0x534b (0x555555728f6b in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #14: + 0x1b1e86 (0x555555705e86 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #15: _PyEval_EvalFrameDefault + 0x4ca3 (0x5555557288c3 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #16: + 0x1b1e86 (0x555555705e86 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #17: _PyEval_EvalFrameDefault + 0x4ca3 (0x5555557288c3 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #18: + 0x1b1e86 (0x555555705e86 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #19: _PyEval_EvalFrameDefault + 0x4ca3 (0x5555557288c3 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #20: _PyEval_EvalCodeWithName + 0x2c3 (0x555555704503 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #21: + 0x1b2007 (0x555555706007 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #22: _PyEval_EvalFrameDefault + 0x1782 (0x5555557253a2 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #23: _PyFunction_Vectorcall + 0x1a6 (0x555555705706 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #24: PyObject_CallObject + 0x53 (0x55555570dd93 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #25: THPFunction_apply(_object, _object) + 0x8fd (0x2aaac76a83fd in /home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/torch/lib/libtorch_python.so)

frame #26: PyCFunction_Call + 0xf9 (0x55555567fe99 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #27: _PyObject_MakeTpCall + 0x31e (0x55555568ef2e in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #28: _PyEval_EvalFrameDefault + 0x534b (0x555555728f6b in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #29: _PyEval_EvalCodeWithName + 0x2c3 (0x555555704503 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #30: + 0x1b2007 (0x555555706007 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #31: _PyEval_EvalFrameDefault + 0x1782 (0x5555557253a2 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #32: _PyEval_EvalCodeWithName + 0x2c3 (0x555555704503 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #33: _PyFunction_Vectorcall + 0x378 (0x5555557058d8 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #34: _PyEval_EvalFrameDefault + 0x1782 (0x5555557253a2 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #35: _PyFunction_Vectorcall + 0x1a6 (0x555555705706 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #36: + 0x1b1f91 (0x555555705f91 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #37: PyObject_Call + 0x5e (0x5555556790be in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #38: _PyEval_EvalFrameDefault + 0x21c1 (0x555555725de1 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #39: _PyEval_EvalCodeWithName + 0x2c3 (0x555555704503 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #40: _PyObject_FastCallDict + 0x2c1 (0x555555673df1 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #41: _PyObject_Call_Prepend + 0x63 (0x55555567e983 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #42: + 0x181b99 (0x5555556d5b99 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #43: _PyObject_MakeTpCall + 0x31e (0x55555568ef2e in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #44: _PyEval_EvalFrameDefault + 0x4f2e (0x555555728b4e in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #45: _PyEval_EvalCodeWithName + 0x2c3 (0x555555704503 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #46: _PyFunction_Vectorcall + 0x378 (0x5555557058d8 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #47: _PyEval_EvalFrameDefault + 0x1782 (0x5555557253a2 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #48: _PyEval_EvalCodeWithName + 0x2c3 (0x555555704503 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #49: _PyFunction_Vectorcall + 0x378 (0x5555557058d8 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #50: _PyEval_EvalFrameDefault + 0x1782 (0x5555557253a2 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #51: _PyFunction_Vectorcall + 0x1a6 (0x555555705706 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #52: PyObject_Call + 0x5e (0x5555556790be in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #53: _PyEval_EvalFrameDefault + 0x21c1 (0x555555725de1 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #54: _PyFunction_Vectorcall + 0x1a6 (0x555555705706 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #55: PyObject_Call + 0x5e (0x5555556790be in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #56: _PyEval_EvalFrameDefault + 0x21c1 (0x555555725de1 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #57: _PyFunction_Vectorcall + 0x1a6 (0x555555705706 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #58: _PyEval_EvalFrameDefault + 0xa4b (0x55555572466b in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #59: _PyEval_EvalCodeWithName + 0x2c3 (0x555555704503 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #60: _PyFunction_Vectorcall + 0x378 (0x5555557058d8 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #61: _PyEval_EvalFrameDefault + 0xa4b (0x55555572466b in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #62: _PyFunction_Vectorcall + 0x1a6 (0x555555705706 in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

frame #63: _PyEval_EvalFrameDefault + 0x92f (0x55555572454f in /home3/md510/anaconda3/envs/foo_k2/bin/python3)

Ended (code 256) at Mon Jun 21 11:19:00 SGT 2021, elapsed time 350 seconds

I don't know where is wrong? Thanks a lot.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/216, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3F7D6VJLNGBSUOU3TTT2X6RANCNFSM47AX6H6A .

[md510@node02 simple_v1]$ ls data/lang_nosp/L.fst.txt -larth -rw-r--r-- 1 md510 users 5.9M Jun 15 18:10 data/lang_nosp/L.fst.txt [md510@node02 simple_v1]$ wc -l data/lang_nosp/phones.txt 278 data/lang_nosp/phones.txt

# Started at Mon Jun 21 13:53:50 +08 2021 on node02 feature shape is torch.Size([34, 1974, 80]) feature shape is torch.Size([37, 1819, 80]) feature shape is torch.Size([14, 4630, 80]) feature shape is torch.Size([38, 1710, 80]) feature shape is torch.Size([18, 3467, 80]) feature shape is torch.Size([32, 2076, 80]) feature shape is torch.Size([22, 2972, 80]) feature shape is torch.Size([33, 2036, 80]) feature shape is torch.Size([35, 1882, 80]) feature shape is torch.Size([43, 1571, 80]) feature shape is torch.Size([32, 2044, 80]) feature shape is torch.Size([28, 2346, 80]) feature shape is torch.Size([38, 1714, 80]) feature shape is torch.Size([25, 2617, 80]) feature shape is torch.Size([21, 3069, 80]) feature shape is torch.Size([17, 3971, 80]) feature shape is torch.Size([29, 2269, 80]) feature shape is torch.Size([31, 2123, 80]) feature shape is torch.Size([29, 2213, 80]) feature shape is torch.Size([36, 1844, 80]) feature shape is torch.Size([36, 1816, 80]) feature shape is torch.Size([32, 2086, 80]) feature shape is torch.Size([33, 1948, 80]) feature shape is torch.Size([33, 1975, 80]) feature shape is torch.Size([20, 3234, 80]) feature shape is torch.Size([41, 1628, 80]) feature shape is torch.Size([25, 2655, 80]) feature shape is torch.Size([40, 1636, 80]) feature shape is torch.Size([22, 2877, 80]) feature shape is torch.Size([26, 2491, 80]) feature shape is torch.Size([40, 1658, 80]) feature shape is torch.Size([26, 2504, 80]) feature shape is torch.Size([24, 2664, 80]) feature shape is torch.Size([43, 1552, 80]) feature shape is torch.Size([29, 2275, 80]) feature shape is torch.Size([24, 2755, 80]) feature shape is torch.Size([39, 1644, 80]) feature shape is torch.Size([21, 3057, 80]) feature shape is torch.Size([31, 2100, 80]) feature shape is torch.Size([40, 1645, 80]) feature shape is torch.Size([30, 2255, 80]) feature shape is torch.Size([37, 1780, 80]) feature shape is torch.Size([22, 2921, 80]) feature shape is torch.Size([39, 1701, 80]) feature shape is torch.Size([33, 1969, 80]) feature shape is torch.Size([33, 1988, 80]) feature shape is torch.Size([32, 2089, 80]) feature shape is torch.Size([53, 1260, 80]) feature shape is torch.Size([32, 2084, 80]) feature shape is torch.Size([38, 1712, 80]) feature shape is torch.Size([28, 2370, 80]) feature shape is torch.Size([23, 2870, 80]) feature shape is torch.Size([50, 1377, 80]) feature shape is torch.Size([31, 2108, 80]) feature shape is torch.Size([25, 2652, 80]) feature shape is torch.Size([50, 1294, 80]) feature shape is torch.Size([48, 1414, 80]) feature shape is torch.Size([28, 2331, 80]) feature shape is torch.Size([38, 1817, 80]) feature shape is torch.Size([23, 2784, 80]) feature shape is torch.Size([40, 1621, 80]) feature shape is torch.Size([40, 1695, 80]) feature shape is torch.Size([36, 1914, 80]) feature shape is torch.Size([39, 1649, 80]) feature shape is torch.Size([39, 1671, 80]) feature shape is torch.Size([39, 1741, 80]) feature shape is torch.Size([35, 1895, 80]) feature shape is torch.Size([40, 1591, 80]) feature shape is torch.Size([39, 1661, 80]) feature shape is torch.Size([34, 1859, 80]) feature shape is torch.Size([34, 1960, 80]) feature shape is torch.Size([41, 1599, 80]) feature shape is torch.Size([37, 1875, 80]) feature shape is torch.Size([40, 1659, 80]) feature shape is torch.Size([34, 1925, 80]) feature shape is torch.Size([43, 1533, 80]) feature shape is torch.Size([37, 1831, 80]) feature shape is torch.Size([27, 2454, 80])

in this librispeech , feature shape is torch.Size([34, 1771, 80]) in this librispeech , feature shape is torch.Size([35, 1779, 80]) in this librispeech , feature shape is torch.Size([35, 1698, 80]) in this librispeech , feature shape is torch.Size([34, 1736, 80]) in this librispeech , feature shape is torch.Size([33, 1807, 80]) in this librispeech , feature shape is torch.Size([35, 1754, 80]) in this librispeech , feature shape is torch.Size([34, 1798, 80]) in this librispeech , feature shape is torch.Size([34, 1791, 80]) in this librispeech , feature shape is torch.Size([36, 1748, 80]) in this librispeech , feature shape is torch.Size([33, 1771, 80]) in this librispeech , feature shape is torch.Size([35, 1757, 80]) in this librispeech , feature shape is torch.Size([35, 1771, 80]) in this librispeech , feature shape is torch.Size([32, 1828, 80]) in this librispeech , feature shape is torch.Size([34, 1723, 80]) in this librispeech , feature shape is torch.Size([35, 1744, 80]) in this librispeech , feature shape is torch.Size([33, 1847, 80]) in this librispeech , feature shape is torch.Size([35, 1847, 80]) in this librispeech , feature shape is torch.Size([34, 1722, 80]) in this librispeech , feature shape is torch.Size([33, 1866, 80]) in this librispeech , feature shape is torch.Size([35, 1672, 80]) in this librispeech , feature shape is torch.Size([33, 1808, 80]) in this librispeech , feature shape is torch.Size([33, 1805, 80]) in this librispeech , feature shape is torch.Size([33, 1817, 80]) in this librispeech , feature shape is torch.Size([36, 1662, 80]) in this librispeech , feature shape is torch.Size([35, 1724, 80]) in this librispeech , feature shape is torch.Size([33, 1727, 80]) in this librispeech , feature shape is torch.Size([35, 1797, 80]) in this librispeech , feature shape is torch.Size([32, 1876, 80]) in this librispeech , feature shape is torch.Size([34, 1731, 80]) in this librispeech , feature shape is torch.Size([34, 1839, 80]) in this librispeech , feature shape is torch.Size([31, 1812, 80]) in this librispeech , feature shape is torch.Size([21, 2857, 80]) in this librispeech , feature shape is torch.Size([19, 3265, 80]) in this librispeech , feature shape is torch.Size([28, 2162, 80]) in this librispeech , feature shape is torch.Size([22, 2712, 80]) in this librispeech , feature shape is torch.Size([22, 2950, 80]) in this librispeech , feature shape is torch.Size([30, 2082, 80]) in this librispeech , feature shape is torch.Size([31, 2022, 80]) in this librispeech , feature shape is torch.Size([24, 2462, 80]) in this librispeech , feature shape is torch.Size([28, 2088, 80]) in this librispeech , feature shape is torch.Size([37, 1770, 80]) in this librispeech , feature shape is torch.Size([27, 2299, 80]) in this librispeech , feature shape is torch.Size([29, 2118, 80]) in this librispeech , feature shape is torch.Size([18, 1450, 80])

k2-fsa / snowfall

CUDA out of memory #216

CUDA_VISIBLE_DEVICES=2,3,4 python3 ./mmi_att_transformer_train_seame.py --world-size 3 --use-ali-model false --num-workers-train 1 --num-workers-valid 1

Invoked at Mon Jun 21 11:13:10 SGT 2021 from node03

Started at Mon Jun 21 11:14:08 +08 2021 on node02

Ended (code 256) at Mon Jun 21 11:19:00 SGT 2021, elapsed time 350 seconds