confomer-ctc/decode.py Failed to decode using a GPU with 11GB memory

ahban commented 2 years ago

Egs: AIshell

After several tries, I still fail to run the confomer-ctc/decode.py with the following

conformer_ctc/decode.py --exp-dir ./val-ban-1 --epoch 89 --avg 10 --method attention-decoder --num-paths 100  --max-duration 5
2022-04-06 02:20:44,369 INFO [decode.py:473] Decoding started
2022-04-06 02:20:44,370 INFO [decode.py:474] {'subsampling_factor': 4, 'feature_dim': 80, 'nhead': 4, 'attention_dim': 512, 'num_encoder_layers': 12, 'num_decoder_layers': 6, 'vgg_frontend': False, 'use_feat_batchnorm': True, 'search_beam': 20, 'output_beam': 7, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'env_info': {'k2-version': '1.8', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '646704e142438bcd1aaf4a6e32d95e5ccd93a174', 'k2-git-date': 'Thu Sep 16 13:05:12 2021', 'lhotse-version': '1.0.0.dev+git.bc74329.clean', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.6', 'icefall-git-branch': 'master', 'icefall-git-sha1': '395a3f9-dirty', 'icefall-git-date': 'Wed Mar 23 11:11:34 2022', 'icefall-path': '/home/data/aban/devel/docker-k2/icefall', 'k2-path': '/usr/local/lib/python3.6/dist-packages/k2/__init__.py', 'lhotse-path': '/usr/local/lib/python3.6/dist-packages/lhotse/__init__.py', 'hostname': '5961fd6be3f9', 'IP address': '192.168.0.5'}, 'epoch': 89, 'avg': 10, 'method': 'attention-decoder', 'num_paths': 100, 'nbest_scale': 0.5, 'exp_dir': PosixPath('val-ban-1'), 'lang_dir': PosixPath('data/lang_char'), 'lm_dir': PosixPath('data/lm'), 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 5, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True}
2022-04-06 02:20:44,820 INFO [lexicon.py:176] Loading pre-compiled data/lang_char/Linv.pt
2022-04-06 02:20:44,846 INFO [decode.py:484] device: cuda:0
2022-04-06 02:21:27,296 INFO [decode.py:533] averaging ['val-ban-1/epoch-80.pt', 'val-ban-1/epoch-81.pt', 'val-ban-1/epoch-82.pt', 'val-ban-1/epoch-83.pt', 'val-ban-1/epoch-84.pt', 'val-ban-1/epoch-85.pt', 'val-ban-1/epoch-86.pt', 'val-ban-1/epoch-87.pt', 'val-ban-1/epoch-88.pt', 'val-ban-1/epoch-89.pt']
2022-04-06 02:21:36,660 INFO [decode.py:540] Number of model parameters: 115125888
2022-04-06 02:21:36,660 INFO [asr_datamodule.py:366] About to get test cuts
/usr/local/lib/python3.6/dist-packages/lhotse/dataset/sampling/simple.py:238: UserWarning: The first cut drawn in batch collection violates the max_frames, max_cuts, or max_duration constraints - we'll return it anyway. Consider increasing max_frames/max_cuts/max_duration.
  "The first cut drawn in batch collection violates "
2022-04-06 02:21:38,918 INFO [decode.py:405] batch 0/?, cuts processed until now is 1
Traceback (most recent call last):
  File "conformer_ctc/decode.py", line 572, in <module>
    main()
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "conformer_ctc/decode.py", line 558, in main
    eos_id=eos_id,
  File "conformer_ctc/decode.py", line 387, in decode_dataset
    eos_id=eos_id,
  File "conformer_ctc/decode.py", line 319, in decode_one_batch
    nbest_scale=params.nbest_scale,
  File "/home/aban/devel/docker-k2/icefall/icefall/decode.py", line 835, in rescore_with_attention_decoder
    nbest = nbest.intersect(lattice)
  File "/home/aban/devel/docker-k2/icefall/icefall/decode.py", line 330, in intersect
    sorted_match_a=True,
  File "/home/aban/devel/docker-k2/icefall/icefall/decode.py", line 60, in _intersect_device
    a_fsas, fsas, b_to_a_map=b_to_a, sorted_match_a=sorted_match_a
  File "/usr/local/lib/python3.6/dist-packages/k2/fsa_algo.py", line 195, in intersect_device
    b_to_a_map, need_arc_map, sorted_match_a)
RuntimeError: CUDA out of memory. Tried to allocate 4.00 GiB (GPU 0; 10.92 GiB total capacity; 6.45 GiB already allocated; 3.31 GiB free; 6.92 GiB reserved in total by PyTorch)

It seems that there is not enough GPU memory. But I have set --max-duration to a very small value. I think the attention-decoder consumes much GPU memory for rescoring. Will it be solved if we move the rescore procedure to the CPU? How?

pkufool commented 2 years ago

Could you please try 1best decoding first to see if the model converging well. According to your logs, it did not reach attention-decoder at all. I suspect that you have a too large lattice. Try using smaller output_beam and max_active_states.

ahban commented 2 years ago

1best works well. But 98% GPU memory has been allocated. I am trying to shrink max_active_states.

pkufool commented 2 years ago

1best works well. But 98% GPU memory has been allocated. I am trying to shrink max_active_states.

I meant the CER, is it as good as expected, 1best decoding should not consume so much memory. Anyway, try decreasing the search_beam, output_beam and max_active_states.

danpovey commented 2 years ago

What are the final loss values, e.g. the final line in your log file that says "Epoch xx, batch xxx"

ahban commented 2 years ago

CER for 1best:

2022-04-06 03:22:15,814 INFO [utils.py:407] [test-no_rescore] %WER 6.74% [7060 / 104765, 112 ins, 1473 del, 5475 sub ]

final loss values:

2022-04-05 20:55:37,598 INFO [train.py:512] (1/2) Epoch 89, batch 12160, loss[ctc_loss=0.05017, att_loss=0.1513, loss=0.1209, over 1739.00 frames.], tot_loss[ctc_loss=0.07525, att_loss=0.2027, loss=0.164
4, over 334052.84 frames.], batch size: 8                                                                                                                                                                  
2022-04-05 20:55:37,600 INFO [train.py:512] (0/2) Epoch 89, batch 12160, loss[ctc_loss=0.06977, att_loss=0.1519, loss=0.1273, over 1600.00 frames.], tot_loss[ctc_loss=0.07497, att_loss=0.2029, loss=0.164
5, over 334191.86 frames.], batch size: 7

pkufool commented 2 years ago

The loss values seem to be normal, but the CER for 1best decoding is little worse, it is expected to be as follows:

%WER = 4.99
Errors: 53 insertions, 350 deletions, 4825 substitutions, over 104765 reference words (99590 correct)
Search below for sections starting with PER-UTT DETAILS:, SUBSTITUTIONS:, DELETIONS:, INSERTIONS:, PER-WORD STATS:

Did you try smaller beams and activate states?

ahban commented 2 years ago

smaller beam and activate states bring worse CER results.

@pkufool Did your 'WER = 4.99' use only Aishell training data?

My model only trained on Aishell dataset.

pkufool commented 2 years ago

Yes, only Aishell dateset. Someone had reproduced the result in our recipe, see https://github.com/k2-fsa/icefall/issues/112#issuecomment-975146755.

ahban commented 2 years ago

@pkufool

I have reproduced the results with

lhotse=1.0.0
k2 version: 1.13
Build type: Release
Git SHA1: 47c4b754bb418b2a40c3ee0f24ca5ed12b08997f
Git date: Sat Jan 29 09:39:32 2022
Cuda used to build k2: 11.1
cuDNN used to build k2: 8.0.4
Python version used to build k2: 3.8
OS used to build k2: Ubuntu 18.04.6 LTS
CMake version: 3.18.4
GCC version: 7.5.0
CMAKE_CUDA_FLAGS:  --expt-extended-lambda -gencode arch=compute_35,code=sm_35 --expt-extended-lambda -gencode arch=compute_50,code=sm_50 --expt-extended-lambda -gencode arch=compute_60,code=sm_60 --expt-extended-lambda -gencode arch=compute_61,code=sm_61 --expt-extended-lambda -gencode arch=compute_70,code=sm_70 --expt-extended-lambda -gencode arch=compute_75,code=sm_75 --expt-extended-lambda -gencode arch=compute_80,code=sm_80 --expt-extended-lambda -gencode arch=compute_86,code=sm_86 -D_GLIBCXX_USE_CXX11_ABI=0 --compiler-options -Wall --compiler-options -Wno-unknown-pragmas --compiler-options -Wno-strict-overflow
CMAKE_CXX_FLAGS:  -D_GLIBCXX_USE_CXX11_ABI=0 -Wno-strict-overflow
PyTorch version used to build k2: 1.8.1
PyTorch is using Cuda: 11.1
NVTX enabled: True
With CUDA: True
Disable debug: True
Sync kernels : False
Disable checks: False

GPU : GTX 3090 24G
OS: ubuntu 18.04
conda install -c k2-fsa -c pytorch -c conda-forge k2=1.13.dev20220129 python=3.8 cudatoolkit=11.1 pytorch=1.8.1 torchaudio=0.8.1

python3 conformer_ctc/train.py --bucketing-sampler True \
                              --max-duration 200 \
                              --start-epoch 0 \
                              --num-epochs 90 \
                              --world-size 4 > train.log

python3 conformer_ctc/decode.py --nbest-scale 0.5 \
                               --epoch 84 \
                               --avg 25 \
                               --method attention-decoder \
                               --max-duration 20 \
                               --num-paths 100

# best CER = 4.26%

Thanks!!!

k2-fsa / icefall

confomer-ctc/decode.py Failed to decode using a GPU with 11GB memory #293