k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
912 stars 292 forks source link

egs/librispeech Conformer MMI training error: MemoryError: std::bad_alloc #960

Open naxingyu opened 1 year ago

naxingyu commented 1 year ago

Environment:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:B2:00.0 Off |                    0 |
| N/A   44C    P0    45W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

k2 version: 1.23.4
Build type: Release
Git SHA1: 0d7ef1a7867f70354ab5c59f2feb98c45558dcc7
Git date: Sat Mar 18 12:59:04 2023
Cuda used to build k2: 11.7
cuDNN used to build k2: 8.2.0
Python version used to build k2: 3.7
OS used to build k2: Ubuntu 18.04.6 LTS
CMake version: 3.25.2
GCC version: 7.5.0
CMAKE_CUDA_FLAGS:  -Wno-deprecated-gpu-targets   -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_35,code=sm_35  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_50,code=sm_50  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_60,code=sm_60  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_61,code=sm_61  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_70,code=sm_70  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_75,code=sm_75  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_80,code=sm_80  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_86,code=sm_86 -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_86,code=compute_86 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=integer_sign_change,--diag_suppress=useless_using_declaration,--diag_suppress=set_but_not_used,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=implicit_return_from_non_void_function,--diag_suppress=unsigned_compare_with_zero,--diag_suppress=declared_but_not_referenced,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -D_GLIBCXX_USE_CXX11_ABI=0 --compiler-options -Wall  --compiler-options -Wno-strict-overflow  --compiler-options -Wno-unknown-pragmas 
CMAKE_CXX_FLAGS:  -D_GLIBCXX_USE_CXX11_ABI=0 -Wno-unused-variable  -Wno-strict-overflow 
PyTorch version used to build k2: 1.13.1+cu117
PyTorch is using Cuda: 11.7
NVTX enabled: True
With CUDA: True
Disable debug: True
Sync kernels : False
Disable checks: False
Max cpu memory allocate: 214748364800 bytes (or 200.0 GB)
k2 abort: False
__file__: /miniconda/lib/python3.7/site-packages/k2/version/version.py
_k2.__file__: /miniconda/lib/python3.7/site-packages/_k2.cpython-37m-x86_64-linux-gnu.so

Command

./conformer_mmi/train.py

Error:

2023-03-22 16:54:07,128 INFO [train.py:584] Epoch 0, batch 10400, batch avg mmi loss 0.3586, batch avg att loss 0.0000, batch avg loss 0.3586, total avg mmiloss: 0.3798, total avg att loss: 0.0000, total avg loss: 0.3798, batch size: 16
[I] /home/runner/work/k2/k2/k2/csrc/intersect_dense.cu:314:k2::FsaVec k2::MultiGraphDenseIntersect::FormatOutput(k2::Array1<int>*, k2::Array1<int>*) Num-arcs 2275827199 exceeds limit 2147483600, decreasing beam from 6.000000 to 4.500000
Traceback (most recent call last):
  File "./conformer_mmi/train.py", line 861, in <module>
    main()
  File "./conformer_mmi/train.py", line 854, in main
    run(rank=0, world_size=1, args=args)
  File "./conformer_mmi/train.py", line 826, in run
    world_size=world_size,
  File "./conformer_mmi/train.py", line 555, in train_one_epoch
    ali=train_ali,
  File "./conformer_mmi/train.py", line 408, in compute_loss
    mmi_loss = loss_fn(dense_fsa_vec=dense_fsa_vec, texts=texts)
  File "/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/task_runtime/icefall/icefall/mmi.py", line 220, in forward
    beam_size=self.beam_size,
  File "/mnt/task_runtime/icefall/icefall/mmi.py", line 119, in _compute_mmi_loss_exact_non_optimized
    den_graphs, dense_fsa_vec, output_beam=beam_size, max_arcs=2147483600
  File "/miniconda/lib/python3.7/site-packages/k2/autograd.py", line 808, in intersect_dense
    seqframe_idx_name, frame_idx_name)
  File "/miniconda/lib/python3.7/site-packages/k2/autograd.py", line 568, in forward
    max_arcs=max_arcs)
MemoryError: std::bad_alloc
csukuangfj commented 1 year ago

It is probably caused by a very large vocab size, e.g., several thousand.

You may need to set https://github.com/k2-fsa/icefall/blob/d74822d07b803f552602e727ebf099f406c74786/egs/aishell/ASR/conformer_mmi/train.py#L220 to true.


@yaozengwei may have more to say about our latest MMI training recipe https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/zipformer_mmi

naxingyu commented 1 year ago

The vocab size is actually not large.

# wc -l data/lang_bpe_500/tokens.txt 
502 data/lang_bpe_500/tokens.txt

But I'll try the option you suggest. @csukuangfj

naxingyu commented 1 year ago

After running ./conformer_mmi/train.py --use-pruned-intersect True, the training time is doubled, and crash only after 2 log intervals

2023-03-23 06:13:03,565 INFO [train.py:584] Epoch 0, batch 0, batch avg mmi loss 1.1368, batch avg att loss 0.0000, batch avg loss 1.1368, total avg mmiloss: 1.1368, total avg att loss: 0.0000, total avg loss: 1.1368, batch size: 11
2023-03-23 06:14:48,609 INFO [train.py:584] Epoch 0, batch 50, batch avg mmi loss 1.0247, batch avg att loss 0.0000, batch avg loss 1.0247, total avg mmiloss: 1.1330, total avg att loss: 0.0000, total avg loss: 1.1330, batch size: 13
terminate called after throwing an instance of 'c10::OutOfMemoryError'
  what():  CUDA out of memory. Tried to allocate 2.38 GiB (GPU 0; 31.75 GiB total capacity; 24.68 GiB already allocated; 1.47 GiB free; 29.04 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Exception raised from malloc at ../c10/cuda/CUDACachingAllocator.cpp:681 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f313dbbc457 in /miniconda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x4442d (0x7f316a12742d in /miniconda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x45158 (0x7f316a128158 in /miniconda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x453c2 (0x7f316a1283c2 in /miniconda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #4: k2::PytorchCudaContext::Allocate(unsigned long, void**) + 0x62 (0x7f308efd8572 in /miniconda/lib/python3.7/site-packages/k2/lib/libk2context.so)
frame #5: k2::NewRegion(std::shared_ptr<k2::Context>, unsigned long) + 0x175 (0x7f308eca4aa5 in /miniconda/lib/python3.7/site-packages/k2/lib/libk2context.so)
frame #6: k2::Renumbering::ComputeOld2New() + 0xac (0x7f308ec559ac in /miniconda/lib/python3.7/site-packages/k2/lib/libk2context.so)
frame #7: k2::Renumbering::Old2New(bool) + 0xc8 (0x7f308eded448 in /miniconda/lib/python3.7/site-packages/k2/lib/libk2context.so)
frame #8: k2::MultiGraphDenseIntersectPruned::PruneTimeRange(int, int) + 0x8ed (0x7f308ee143bd in /miniconda/lib/python3.7/site-packages/k2/lib/libk2context.so)
frame #9: std::_Function_handler<void (), k2::MultiGraphDenseIntersectPruned::Intersect(std::shared_ptr<k2::DenseFsaVec>&)::{lambda()#1}>::_M_invoke(std::_Any_data const&) + 0x26e (0x7f308ee16d4e in /miniconda/lib/python3.7/site-packages/k2/lib/libk2context.so)
frame #10: k2::ThreadPool::ProcessTasks() + 0x16d (0x7f308ef9b2dd in /miniconda/lib/python3.7/site-packages/k2/lib/libk2context.so)
frame #11: <unknown function> + 0xc819d (0x7f319829619d in /miniconda/lib/libstdc++.so.6)
frame #12: <unknown function> + 0x76db (0x7f31b763a6db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #13: clone + 0x3f (0x7f31b7363a3f in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)