k2-fsa / k2

FSA/FST algorithms, differentiable, with PyTorch compatibility.
https://k2-fsa.github.io/k2
Apache License 2.0
1.08k stars 211 forks source link

Error during run time in 1st epoch using conformer_ctc2 OTC #1288

Closed kerolos closed 1 month ago

kerolos commented 1 month ago

During training a conformer_ctc2 in the first epoch , it crashed and got this error, -The data used "Kalid format "segments with max 20 seconds per seg" converted to lhoste". I tried to used different features ssl and fbank and ubfortionatlly got same error

Run script: Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set). 2024-06-03 10:18:40,801 INFO [train_phone_kaldi.py:1040] Training started 2024-06-03 10:18:40,801 INFO [train_phone_kaldi.py:1041] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 1, 'reset_interval': 200, 'valid_interval': 1600, 'alignment_interval': 100, 'subsampling_factor': 2, 'cnn_module_kernel': 31, 'encoder_dim': 512, 'nhead': 8, 'dim_feedforward': 2048, 'num_encoder_layers': 12, 'beam_size': 10, 'reduction': 'sum', 'use_double_scores': True, 'model_warm_step': 3000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '4c05309499a08454997adf500b56dcc629e35ae5', 'k2-git-date': 'Tue Jul 25 16:23:36 2023', 'lhotse-version': '1.24.0.dev+git.4f014b1.clean', 'torch-version': '1.13.0+cu116', 'torch-cuda-available': True, 'torch-cuda-version': '11.6', 'python-version': '3.8', 'icefall-git-branch': None, 'icefall-git-sha1': None, 'icefall-git-date': None, 'icefall-path': '/mnt/srv/data/train_am/analysisTD/icefall_kaldi/icefall', 'k2-path': '/home/ghk/miniconda3/envs/icefall-run/lib/python3.8/site-packages/k2/init.py', 'lhotse-path': '/home/ghk/miniconda3/envs/icefall-run/lib/python3.8/site-packages/lhotse/init.py', 'hostname': 'Hyrican-3', 'IP address': '127.0.1.1'}, 'world_size': 1, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 20, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('/mnt/srv/data/train_am/analysisTD/icefall_kaldi/_am_cleanup_otc_alignments/uk/2024_05_29-20sec-v10_after-GMM-cleaning/exp/models/_model_try1'), 'lang_dir': '/mnt/srv/data/train_am/analysisTD/icefall_kaldi/_am_cleanup_otc_alignments/uk/2024_05_29-20sec-v10_after-GMM-cleaning/exp//lang', 'feature_dim': 768, 'initial_lr': 0.003, 'lr_batches': 5000, 'lr_epochs': 6, 'att_rate': 0.0, 'num_decoder_layers': 0, 'seed': 42, 'print_diagnostics': False, 'save_every_n': 8000, 'keep_last_k': 10, 'average_period': 100, 'use_fp16': True, 'otc_token': '', 'allow_bypass_arc': True, 'allow_self_loop_arc': False, 'initial_bypass_weight': -19.0, 'initial_self_loop_weight': 3.75, 'bypass_weight_decay': 0.975, 'self_loop_weight_decay': 0.999, 'show_alignment': True, 'manifest_dir': PosixPath('/mnt/srv/data/train_am/analysisTD/icefall_kaldi/_am_cleanup_otc_alignments/uk/2024_05_29-20sec-v10_after-GMM-cleaning/data/ssl'), 'max_duration': 200.0, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'input_strategy': 'PrecomputedFeatures', 'train_manifest': 'uk_cuts_train.jsonl.gz', 'dev_manifest': 'uk_cuts_train.jsonl.gz', 'test_manifest': 'kaldi_cuts_test.jsonl.gz'} 2024-06-03 10:18:40,993 INFO [lexicon.py:168] Loading pre-compiled /mnt/srv/data/train_am/analysisTD/icefall_kaldi/_am_cleanup_otc_alignments/uk/2024_05_29-20sec-v10_after-GMM-cleaning/exp/lang/Linv.pt

Error: 2024-05-31 15:42:22,402 INFO [train_phone_kaldi.py:979] Epoch 1, batch 7, loss[otc_loss=2.869, loss=2.869, over 3893.00 frames. utt_duration=599.6 frames, utt_pad_proportion=0.2326, over 26.00 utterances.], tot_loss[otc_loss=2.866, loss=2.866, over 33329.54 frames. utt_duration=609.2 frames, utt_pad_proportion=0.1548, over 219.07 utterances.], batch size: 27, lr: 3.00e-03 2024-05-31 15:42:23,249 INFO [train_phone_kaldi.py:979] Epoch 1, batch 8, loss[otc_loss=2.837, loss=2.837, over 4308.00 frames. utt_duration=718.2 frames, utt_pad_proportion=0.17, over 24.00 utterances.], tot_loss[otc_loss=2.863, loss=2.863, over 37470.89 frames. utt_duration=620 frames, utt_pad_proportion=0.1563, over 241.98 utterances.], batch size: 25, lr: 3.00e-03 [F] /var/www/k2/csrc/intersect_dense.cu:174:k2::MultiGraphDenseIntersect::MultiGraphDenseIntersect(k2::FsaVec&, k2::DenseFsaVec&, const k2::Array1&, float, int32_t, int32_t) Check failed: is_decreasing Sequences (DenseFsaVec) must be in sorted order from greatest to least length. Current seq_len is: [ 182 182 182 182 182 182 182 182 182 181 180 180 180 180 180 180 180 178 175 176 169 152 120 103 79 54 40 40 ] [ Stack-Trace: ] /home/ghk/miniconda3/envs/icefall-run/lib/python3.8/site-packages/k2/lib64/libk2_log.so(k2::internal::GetStackTrace()+0x34) [0x7f0d4bc2fdb4] /home/ghk/miniconda3/envs/icefall-run/lib/python3.8/site-packages/k2/lib64/libk2context.so(k2::internal::Logger::~Logger()+0x2a) [0x7f0cb26f700a] /home/ghk/miniconda3/envs/icefall-run/lib/python3.8/site-packages/k2/lib64/libk2context.so(k2::MultiGraphDenseIntersect::MultiGraphDenseIntersect(k2::Raggedk2::Arc&, k2::DenseFsaVec&, k2::Array1 const&, float, int, int)+0x1021) [0x7f0cb2895861] /home/ghk/miniconda3/envs/icefall-run/lib/python3.8/site-packages/k2/lib64/libk2context.so(k2::IntersectDense(k2::Raggedk2::Arc&, k2::DenseFsaVec&, k2::Array1 const, float, int, int, k2::Raggedk2::Arc, k2::Array1, k2::Array1)+0x1be) [0x7f0cb287eede] /home/ghk/miniconda3/envs/icefall-run/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x8968f) [0x7f0cb851068f] /home/ghk/miniconda3/envs/icefall-run/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x3cc5e) [0x7f0cb84c3c5e] python(PyCFunction_Call+0x52) [0x4f5412] python(_PyObject_MakeTpCall+0x3bb) [0x4e0cab] python(_PyEval_EvalFrameDefault+0x4fb6) [0x4dcf36] python(_PyEval_EvalCodeWithName+0x2f1) [0x4d6f41] python(_PyFunction_Vectorcall+0x19c) [0x4e80cc] python(PyObject_Call+0x24a) [0x4f746a] /home/ghk/miniconda3/envs/icefall-run/lib/python3.8/site-packages/torch/lib/libtorch_python.so(THPFunction_apply(_object, _object)+0x60e) [0x7f0d50fa8f1e]python(PyCFunction_Call+0xe5) [0x4f54a5] python(_PyObject_MakeTpCall+0x3bb) [0x4e0cab] python(_PyEval_EvalFrameDefault+0x4e04) [0x4dcd84] python(_PyEval_EvalCodeWithName+0x2f1) [0x4d6f41] python(_PyFunction_Vectorcall+0x19c) [0x4e80cc] python(_PyEval_EvalFrameDefault+0x115a) [0x4d90da] python(_PyEval_EvalCodeWithName+0x2f1) [0x4d6f41] python(_PyFunction_Vectorcall+0x19c) [0x4e80cc] python() [0x4f4ff4] python(PyObject_Call+0x34e) [0x4f756e] python(_PyEval_EvalFrameDefault+0x207d) [0x4d9ffd] python(_PyEval_EvalCodeWithName+0x2f1) [0x4d6f41] python(_PyFunction_Vectorcall+0x19c) [0x4e80cc] python(_PyObject_FastCallDict+0x282) [0x4e0462] python(_PyObject_Call_Prepend+0x60) [0x4f1d70] python() [0x5abdf7] python(_PyObject_MakeTpCall+0x3bb) [0x4e0cab] python(_PyEval_EvalFrameDefault+0x4fb6) [0x4dcf36] Traceback (most recent call last): File "./conformer_ctc2/train_phone_kaldi.py", line 1273, in main() File "./conformer_ctc2/train_phone_kaldi.py", line 1266, in main run(rank=0, world_size=1, args=args) File "./conformer_ctc2/train_phone_kaldi.py", line 1174, in run train_one_epoch( File "./conformer_ctc2/train_phone_kaldi.py", line 909, in train_one_epoch loss, loss_info = compute_loss( File "./conformer_ctc2/train_phone_kaldi.py", line 674, in compute_loss otc_loss = k2.ctc_loss( File "/home/ghk/miniconda3/envs/icefall-run/lib/python3.8/site-packages/k2/ctc_loss.py", line 203, in ctc_loss return m( File "/home/ghk/miniconda3/envs/icefall-run/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/ghk/miniconda3/envs/icefall-run/lib/python3.8/site-packages/k2/ctc_loss.py", line 92, in forward lattice = intersect_dense( File "/home/ghk/miniconda3/envs/icefall-run/lib/python3.8/site-packages/k2/autograd.py", line 805, in intersect_dense _IntersectDenseFunction.apply(a_fsas, b_fsas, out_fsa, output_beam, File "/home/ghk/miniconda3/envs/icefall-run/lib/python3.8/site-packages/k2/autograd.py", line 562, in forward ragged_arc, arc_map_a, arc_map_b = _k2.intersect_dense( RuntimeError: Some bad things happened. Please read the above error messages and stack trace. If you are using Python, the following command may be helpful:

gdb --args python /path/to/your/code.py

(You can use gdb to debug the code. Please consider compiling a debug version of k2.).

kerolos commented 1 month ago
    It had been solved by adding  cut_set.trim_to_supervisions after cut_set.compute_and_store_features to the ./local/compute_fbank.py 
        cut_set = cut_set.trim_to_supervisions(
            keep_overlapping=False, min_duration=None
        )
        cut_set.to_file(output_dir / cuts_filename)