Open naxingyu opened 1 year ago
It is probably caused by a very large vocab size, e.g., several thousand.
You may need to set https://github.com/k2-fsa/icefall/blob/d74822d07b803f552602e727ebf099f406c74786/egs/aishell/ASR/conformer_mmi/train.py#L220 to true.
@yaozengwei may have more to say about our latest MMI training recipe https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/zipformer_mmi
The vocab size is actually not large.
# wc -l data/lang_bpe_500/tokens.txt
502 data/lang_bpe_500/tokens.txt
But I'll try the option you suggest. @csukuangfj
After running ./conformer_mmi/train.py --use-pruned-intersect True
, the training time is doubled, and crash only after 2 log intervals
2023-03-23 06:13:03,565 INFO [train.py:584] Epoch 0, batch 0, batch avg mmi loss 1.1368, batch avg att loss 0.0000, batch avg loss 1.1368, total avg mmiloss: 1.1368, total avg att loss: 0.0000, total avg loss: 1.1368, batch size: 11
2023-03-23 06:14:48,609 INFO [train.py:584] Epoch 0, batch 50, batch avg mmi loss 1.0247, batch avg att loss 0.0000, batch avg loss 1.0247, total avg mmiloss: 1.1330, total avg att loss: 0.0000, total avg loss: 1.1330, batch size: 13
terminate called after throwing an instance of 'c10::OutOfMemoryError'
what(): CUDA out of memory. Tried to allocate 2.38 GiB (GPU 0; 31.75 GiB total capacity; 24.68 GiB already allocated; 1.47 GiB free; 29.04 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Exception raised from malloc at ../c10/cuda/CUDACachingAllocator.cpp:681 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f313dbbc457 in /miniconda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x4442d (0x7f316a12742d in /miniconda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x45158 (0x7f316a128158 in /miniconda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x453c2 (0x7f316a1283c2 in /miniconda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #4: k2::PytorchCudaContext::Allocate(unsigned long, void**) + 0x62 (0x7f308efd8572 in /miniconda/lib/python3.7/site-packages/k2/lib/libk2context.so)
frame #5: k2::NewRegion(std::shared_ptr<k2::Context>, unsigned long) + 0x175 (0x7f308eca4aa5 in /miniconda/lib/python3.7/site-packages/k2/lib/libk2context.so)
frame #6: k2::Renumbering::ComputeOld2New() + 0xac (0x7f308ec559ac in /miniconda/lib/python3.7/site-packages/k2/lib/libk2context.so)
frame #7: k2::Renumbering::Old2New(bool) + 0xc8 (0x7f308eded448 in /miniconda/lib/python3.7/site-packages/k2/lib/libk2context.so)
frame #8: k2::MultiGraphDenseIntersectPruned::PruneTimeRange(int, int) + 0x8ed (0x7f308ee143bd in /miniconda/lib/python3.7/site-packages/k2/lib/libk2context.so)
frame #9: std::_Function_handler<void (), k2::MultiGraphDenseIntersectPruned::Intersect(std::shared_ptr<k2::DenseFsaVec>&)::{lambda()#1}>::_M_invoke(std::_Any_data const&) + 0x26e (0x7f308ee16d4e in /miniconda/lib/python3.7/site-packages/k2/lib/libk2context.so)
frame #10: k2::ThreadPool::ProcessTasks() + 0x16d (0x7f308ef9b2dd in /miniconda/lib/python3.7/site-packages/k2/lib/libk2context.so)
frame #11: <unknown function> + 0xc819d (0x7f319829619d in /miniconda/lib/libstdc++.so.6)
frame #12: <unknown function> + 0x76db (0x7f31b763a6db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #13: clone + 0x3f (0x7f31b7363a3f in /lib/x86_64-linux-gnu/libc.so.6)
Aborted (core dumped)
Environment:
Command
Error: