k2-fsa / k2

FSA/FST algorithms, differentiable, with PyTorch compatibility.
Apache License 2.0
1.08k stars 211 forks source link

zipformer error #1255

Closed AlexandderGorodetski closed 8 months ago

AlexandderGorodetski commented 8 months ago

Hello guys,

I am suing tedlium to train K2 model on my inhouse data.

I get following error:

Do you have some recommendataion?

2023-10-23 17:08:05,502 INFO [train.py:1064] Training started 2023-10-23 17:08:05,506 INFO [train.py:1074] Device: cuda:0 2023-10-23 17:08:05,509 INFO [train.py:1083] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '4c05309499a08454997adf500b56dcc629e35ae5', 'k2-git-date': 'Tue Jul 25 16:23:36 2023', 'lhotse-version': '1.17.0.dev+git.a4701868.clean', 'torch-version': '1.13.0+cu116', 'torch-cuda-available': True, 'torch-cuda-version': '11.6', 'python-version': '3.8', 'icefall-git-branch': 'master', 'icefall-git-sha1': '52c24df-dirty', 'icefall-git-date': 'Wed Oct 18 12:36:14 2023', 'icefall-path': '/workspace/inputs/alexg/asr/src/models/k2/icefall', 'k2-path': '/opt/conda/lib/python3.8/site-packages/k2/init.py', 'lhotse-path': '/opt/conda/lib/python3.8/site-packages/lhotse/init.py', 'hostname': 'gpu-alex-speech-base-0-0', 'IP address': ''}, 'world_size': 1, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 50, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp'), 'bpe_model': 'data/lang_bpe_500/bpe.model', 'base_lr': 0.04, 'lr_batches': 7500, 'lr_epochs': 5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'rnnt_type': 'regular', 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 1, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 1000, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'blank_id': 0, 'vocab_size': 500} 2023-10-23 17:08:05,509 INFO [train.py:1085] About to create model 2023-10-23 17:08:05,943 INFO [train.py:1089] Number of model parameters: 65549011 2023-10-23 17:08:09,270 INFO [asr_datamodule.py:353] About to get train cuts /opt/conda/lib/python3.8/site-packages/smart_open/smart_open_lib.py:250: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function warnings.warn( /opt/conda/lib/python3.8/site-packages/lhotse/lazy.py:398: UserWarning: A lambda was passed to LazyFilter: it may prevent you from forking this process. If you experience issues with num_workers > 0 in torch.utils.data.DataLoader, try passing a regular function instead. warnings.warn( 2023-10-23 17:08:09,547 INFO [asr_datamodule.py:185] Enable SpecAugment 2023-10-23 17:08:09,547 INFO [asr_datamodule.py:186] Time warp factor: 80 2023-10-23 17:08:09,547 INFO [asr_datamodule.py:202] About to get Musan cuts 2023-10-23 17:08:09,547 INFO [asr_datamodule.py:205] Enable MUSAN 2023-10-23 17:08:11,741 INFO [asr_datamodule.py:227] About to create train dataset 2023-10-23 17:08:11,741 INFO [asr_datamodule.py:253] Using DynamicBucketingSampler. 2023-10-23 17:08:14,342 INFO [asr_datamodule.py:273] About to create train dataloader 2023-10-23 17:08:14,343 INFO [asr_datamodule.py:360] About to get dev cuts 2023-10-23 17:08:14,344 INFO [asr_datamodule.py:293] About to create dev dataset 2023-10-23 17:08:14,349 INFO [asr_datamodule.py:312] About to create dev dataloader 2023-10-23 17:08:14,350 INFO [train.py:1257] Sanity check -- see if any of the batches in epoch 1 would cause OOM. 2023-10-23 17:09:32,540 INFO [train.py:1235] Saving batch to zipformer/exp/batch-b8db0672-f42d-47cc-00d4-af5974273ca3.pt 2023-10-23 17:09:32,614 INFO [train.py:1241] features shape: torch.Size([50, 1982, 80]) 2023-10-23 17:09:32,615 INFO [train.py:1245] num tokens: 1976 Traceback (most recent call last): File "./zipformer/train.py", line 1308, in main() File "./zipformer/train.py", line 1301, in main run(rank=0, world_size=1, args=args) File "./zipformer/train.py", line 1156, in run scan_pessimistic_batches_for_oom( File "./zipformer/train.py", line 1265, in scan_pessimistic_batches_foroom loss, = compute_loss(2023-10-23 17:08:05,502 INFO [train.py:1064] Training started 2023-10-23 17:08:05,506 INFO [train.py:1074] Device: cuda:0 2023-10-23 17:08:05,509 INFO [train.py:1083] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '4c05309499a08454997adf500b56dcc629e35ae5', 'k2-git-date': 'Tue Jul 25 16:23:36 2023', 'lhotse-version': '1.17.0.dev+git.a4701868.clean', 'torch-version': '1.13.0+cu116', 'torch-cuda-available': True, 'torch-cuda-version': '11.6', 'python-version': '3.8', 'icefall-git-branch': 'master', 'icefall-git-sha1': '52c24df-dirty', 'icefall-git-date': 'Wed Oct 18 12:36:14 2023', 'icefall-path': '/workspace/inputs/alexg/asr/src/models/k2/icefall', 'k2-path': '/opt/conda/lib/python3.8/site-packages/k2/init.py', 'lhotse-path': '/opt/conda/lib/python3.8/site-packages/lhotse/init.py', 'hostname': 'gpu-alex-speech-base-0-0', 'IP address': ''}, 'world_size': 1, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 50, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp'), 'bpe_model': 'data/lang_bpe_500/bpe.model', 'base_lr': 0.04, 'lr_batches': 7500, 'lr_epochs': 5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'rnnt_type': 'regular', 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 1, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 1000, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'blank_id': 0, 'vocab_size': 500} 2023-10-23 17:08:05,509 INFO [train.py:1085] About to create model 2023-10-23 17:08:05,943 INFO [train.py:1089] Number of model parameters: 65549011 2023-10-23 17:08:09,270 INFO [asr_datamodule.py:353] About to get train cuts /opt/conda/lib/python3.8/site-packages/smart_open/smart_open_lib.py:250: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function warnings.warn( /opt/conda/lib/python3.8/site-packages/lhotse/lazy.py:398: UserWarning: A lambda was passed to LazyFilter: it may prevent you from forking this process. If you experience issues with num_workers > 0 in torch.utils.data.DataLoader, try passing a regular function instead. warnings.warn( 2023-10-23 17:08:09,547 INFO [asr_datamodule.py:185] Enable SpecAugment 2023-10-23 17:08:09,547 INFO [asr_datamodule.py:186] Time warp factor: 80 2023-10-23 17:08:09,547 INFO [asr_datamodule.py:202] About to get Musan cuts 2023-10-23 17:08:09,547 INFO [asr_datamodule.py:205] Enable MUSAN 2023-10-23 17:08:11,741 INFO [asr_datamodule.py:227] About to create train dataset 2023-10-23 17:08:11,741 INFO [asr_datamodule.py:253] Using DynamicBucketingSampler. 2023-10-23 17:08:14,342 INFO [asr_datamodule.py:273] About to create train dataloader 2023-10-23 17:08:14,343 INFO [asr_datamodule.py:360] About to get dev cuts 2023-10-23 17:08:14,344 INFO [asr_datamodule.py:293] About to create dev dataset 2023-10-23 17:08:14,349 INFO [asr_datamodule.py:312] About to create dev dataloader 2023-10-23 17:08:14,350 INFO [train.py:1257] Sanity check -- see if any of the batches in epoch 1 would cause OOM. 2023-10-23 17:09:32,540 INFO [train.py:1235] Saving batch to zipformer/exp/batch-b8db0672-f42d-47cc-00d4-af5974273ca3.pt 2023-10-23 17:09:32,614 INFO [train.py:1241] features shape: torch.Size([50, 1982, 80]) 2023-10-23 17:09:32,615 INFO [train.py:1245] num tokens: 1976 Traceback (most recent call last): File "./zipformer/train.py", line 1308, in main() File "./zipformer/train.py", line 1301, in main run(rank=0, world_size=1, args=args) File "./zipformer/train.py", line 1156, in run scan_pessimistic_batches_for_oom( File "./zipformer/train.py", line 1265, in scan_pessimistic_batches_foroom loss, = compute_loss( File "./zipformer/train.py", line 766, in compute_loss simple_loss, pruned_loss = model( File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, *kwargs) File "/workspace/inputs/alexg/asr/src/models/k2/icefall/egs/tedlium3/ASR/zipformer/model.py", line 128, in forward assert x.size(0) == x_lens.size(0) == y.dim0 AssertionError File "./zipformer/train.py", line 766, in compute_loss simple_loss, pruned_loss = model( File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(input, **kwargs) File "/workspace/inputs/alexg/asr/src/models/k2/icefall/egs/tedlium3/ASR/zipformer/model.py", line 128, in forward assert x.size(0) == x_lens.size(0) == y.dim0 AssertionError

csukuangfj commented 8 months ago

Please print the shape of x and y and the value of x_lens on assertion failure.

AlexandderGorodetski commented 8 months ago

I did the change that you requested. Moreover I afraid that the problem is that I use only 200 hours of data for trainin (but it is just for debugging, I will increase number of hours later). In any case following is the output after reduction number of paramters of the model. Unfortunately I see same error

023-10-23 17:54:46,703 INFO [train.py:1064] Training started 2023-10-23 17:54:46,721 INFO [train.py:1074] Device: cuda:0 2023-10-23 17:54:46,724 INFO [train.py:1083] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '4c05309499a08454997adf500b56dcc629e35ae5', 'k2-git-date': 'Tue Jul 25 16:23:36 2023', 'lhotse-version': '1.17.0.dev+git.a4701868.clean', 'torch-version': '1.13.0+cu116', 'torch-cuda-available': True, 'torch-cuda-version': '11.6', 'python-version': '3.8', 'icefall-git-branch': 'master', 'icefall-git-sha1': '52c24df-dirty', 'icefall-git-date': 'Wed Oct 18 12:36:14 2023', 'icefall-path': '/workspace/inputs/alexg/asr/src/models/k2/icefall', 'k2-path': '/opt/conda/lib/python3.8/site-packages/k2/init.py', 'lhotse-path': '/opt/conda/lib/python3.8/site-packages/lhotse/init.py', 'hostname': 'gpu-alex-speech-base-0-0', 'IP address': ''}, 'world_size': 1, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 50, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp'), 'bpe_model': 'data/lang_bpe_500/bpe.model', 'base_lr': 0.04, 'lr_batches': 7500, 'lr_epochs': 5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'rnnt_type': 'regular', 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 1, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '1,1,1,1,1,1', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,128,128,128,128,128', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,256,256,256,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,192,192,192,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 1000, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'blank_id': 0, 'vocab_size': 500} 2023-10-23 17:54:46,724 INFO [train.py:1085] About to create model 2023-10-23 17:54:46,910 INFO [train.py:1089] Number of model parameters: 7672553 2023-10-23 17:54:48,466 INFO [asr_datamodule.py:353] About to get train cuts /opt/conda/lib/python3.8/site-packages/smart_open/smart_open_lib.py:250: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function warnings.warn( /opt/conda/lib/python3.8/site-packages/lhotse/lazy.py:398: UserWarning: A lambda was passed to LazyFilter: it may prevent you from forking this process. If you experience issues with num_workers > 0 in torch.utils.data.DataLoader, try passing a regular function instead. warnings.warn( 2023-10-23 17:54:48,661 INFO [asr_datamodule.py:185] Enable SpecAugment 2023-10-23 17:54:48,661 INFO [asr_datamodule.py:186] Time warp factor: 80 2023-10-23 17:54:48,661 INFO [asr_datamodule.py:202] About to get Musan cuts 2023-10-23 17:54:48,661 INFO [asr_datamodule.py:205] Enable MUSAN 2023-10-23 17:54:50,787 INFO [asr_datamodule.py:227] About to create train dataset 2023-10-23 17:54:50,788 INFO [asr_datamodule.py:253] Using DynamicBucketingSampler. 2023-10-23 17:54:53,354 INFO [asr_datamodule.py:273] About to create train dataloader 2023-10-23 17:54:53,355 INFO [asr_datamodule.py:360] About to get dev cuts 2023-10-23 17:54:53,356 INFO [asr_datamodule.py:293] About to create dev dataset 2023-10-23 17:54:53,361 INFO [asr_datamodule.py:312] About to create dev dataloader 2023-10-23 17:54:53,361 INFO [train.py:1257] Sanity check -- see if any of the batches in epoch 1 would cause OOM. x shape: torch.Size([50, 1982, 80]), y shape: [ [ x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] ], x_lens: tensor([1982, 1154, 1120, 1091, 1083, 1052, 1043, 1034, 1024, 1023, 990, 962, 951, 951, 950, 949, 945, 936, 934, 922, 918, 914, 1143, 1122, 1090, 1074, 1063, 1046, 1038, 1028, 1017, 1017, 1017, 1016, 998, 984, 976, 974, 963, 962, 941, 939, 936, 933, 931, 928, 924, 924, 923], device='cuda:0', dtype=torch.int32) 2023-10-23 17:56:12,175 INFO [train.py:1235] Saving batch to zipformer/exp/batch-b8db0672-f42d-47cc-00d4-af5974273ca3.pt 2023-10-23 17:56:12,243 INFO [train.py:1241] features shape: torch.Size([50, 1982, 80]) 2023-10-23 17:56:12,245 INFO [train.py:1245] num tokens: 1976 Traceback (most recent call last): File "./zipformer/train.py", line 1308, in main() File "./zipformer/train.py", line 1301, in main run(rank=0, world_size=1, args=args) File "./zipformer/train.py", line 1156, in run scan_pessimistic_batches_for_oom( File "./zipformer/train.py", line 1265, in scan_pessimistic_batches_foroom loss, = compute_loss( File "./zipformer/train.py", line 766, in compute_loss simple_loss, pruned_loss = model( File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/workspace/inputs/alexg/asr/src/models/k2/icefall/egs/tedlium3/ASR/zipformer/model.py", line 129, in forward assert x.size(0) == x_lens.size(0) == y.dim0 AssertionError

csukuangfj commented 8 months ago

What is the value of y.dim0, x.size(0) and x_len.size(0)?

AlexandderGorodetski commented 8 months ago

2023-10-23 18:12:25,863 INFO [train.py:1257] Sanity check -- see if any of the batches in epoch 1 would cause OOM. x shape: torch.Size([50, 1982, 80]), y shape: [ [ x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] ], x_lens: tensor([1982, 1154, 1120, 1091, 1083, 1052, 1043, 1034, 1024, 1023, 990, 962, 951, 951, 950, 949, 945, 936, 934, 922, 918, 914, 1143, 1122, 1090, 1074, 1063, 1046, 1038, 1028, 1017, 1017, 1017, 1016, 998, 984, 976, 974, 963, 962, 941, 939, 936, 933, 931, 928, 924, 924, 923], device='cuda:0', dtype=torch.int32), y.dim0: 49, x.size(0): 50, x_lens.size(0): 49 2023-10-23 18:13:44,954 INFO [train.py:1235] Saving batch to zipformer/exp/batch-b8db0672-f42d-47cc-00d4-af5974273ca3.pt 2023-10-23 18:13:45,023 INFO [train.py:1241] features shape: torch.Size([50, 1982, 80]) 2023-10-23 18:13:45,025 INFO [train.py:1245] num tokens: 1976 Traceback (most recent call last): File "./zipformer/train.py", line 1308, in main() File "./zipformer/train.py", line 1301, in main run(rank=0, world_size=1, args=args) File "./zipformer/train.py", line 1156, in run scan_pessimistic_batches_for_oom( File "./zipformer/train.py", line 1265, in scan_pessimistic_batches_foroom loss, = compute_loss( File "./zipformer/train.py", line 766, in compute_loss simple_loss, pruned_loss = model( File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/workspace/inputs/alexg/asr/src/models/k2/icefall/egs/tedlium3/ASR/zipformer/model.py", line 129, in forward assert x.size(0) == x_lens.size(0) == y.dim0 AssertionError

AlexandderGorodetski commented 8 months ago

I have a question. Maybe it is because I have 's in my transcritpions?

csukuangfj commented 8 months ago

x contains an extra item. Please recheck your data.

AlexandderGorodetski commented 8 months ago

Could you please recommend what exactly should I check. Maybe I should verify that all words are defined in the dictionary? Maybe I should check that I do not have empty utterances? Do you have some intuition of what can be the problem?

desh2608 commented 8 months ago

You have 50 items in the batch for x but only 49 for x_lens and y. You need to check and make sure you are preparing the batches correctly.

AlexandderGorodetski commented 8 months ago

Frankly, I use Tedlium recipe to build my ASR. I've just prepared manifests files in the same manner like Tedlium recipe does and I expected that the training will run properly. Would you recommend me what lines should I debug and I will do it.

desh2608 commented 8 months ago

You should put a breakpoint here and inspect the tensors feature, feature_lens, and y. You may also have go into the K2SpeechRecognitionDataset in Lhotse to see where the issue is.

AlexandderGorodetski commented 8 months ago

I see that for most batches the values of x.size(0) , x_lens.size(0), y.dim0 are exactly same. Maybe there is one or several problematic batches. I would like to skip them. Could you please recommend me how can I change the code in order to skip the problematic batches?

desh2608 commented 8 months ago

I would suggest that you try to identify the cause of the issue instead of just ignoring the problematic batches.

AlexandderGorodetski commented 8 months ago

Guys, for faster debugging, I would like to decrease my training database from 200 hours to let's say 1 hour (of course I will care that the problem will not disappear). Is this OK with you?

AlexandderGorodetski commented 8 months ago

Guys, is there someone here who implemented following code

train_dl = tedlium.train_dataloaders(
    train_cuts, sampler_state_dict=sampler_state_dict

It seems that this code creates (rarely) batches with different x and y sizes.

Could you recommend me how can I debug it. Can I do it in small part of data, in order to optimize debugging time.

csukuangfj commented 8 months ago

2023-10-23 17:56:12,175 INFO [train.py:1235] Saving batch to zipformer/exp/batch-b8db0672-f42d-47cc-00d4-af5974273ca3.pt

From your error log, you can use

batch = torch.load("zipformer/exp/batch-b8db0672-f42d-47cc-00d4-af5974273ca3.pt")

and inspect the above batch to find out why it causes the error.

There is no need to re-run it from scratch.

AlexandderGorodetski commented 8 months ago

I slightly change the database. And I still have the same error. I tried to load the batch like you recommended:

len(batch["supervisions"]["text"]) 159 batch["inputs"].size() torch.Size([162, 617, 80])

It seems that in this special case I am missing 3 text examples.

Could you recommend about next steps?

AlexandderGorodetski commented 8 months ago

Can we verify that filter bank extraction was performed properly.

Could you please recommend some sanity check for filter bank extractor.

csukuangfj commented 8 months ago

We have detailed tests for the filterbank extractor and it is very unlikely there are errors with it. If you think there are issues with the filterbank extractor, please provide more information about it.

AlexandderGorodetski commented 8 months ago

My suspicion is only about this command

cut_set = cut_set.trim_to_supervisions(keep_overlapping=False)

This command fails with following error

Traceback (most recent call last): File "local/compute_fbank_tedlium_1.py", line 117, in compute_fbank_tedlium() File "local/compute_fbank_tedlium_1.py", line 103, in compute_fbank_tedlium cut_set.to_file(output_dir / f"{prefix}cuts{partition}.{suffix}") File "/opt/conda/lib/python3.8/site-packages/lhotse/serialization.py", line 532, in to_file store_manifest(self, path) File "/opt/conda/lib/python3.8/site-packages/lhotse/serialization.py", line 517, in store_manifest manifest.to_jsonl(path) File "/opt/conda/lib/python3.8/site-packages/lhotse/serialization.py", line 300, in to_jsonl save_to_jsonl(self.to_dicts(), path) File "/opt/conda/lib/python3.8/site-packages/lhotse/serialization.py", line 125, in save_to_jsonl for item in data: File "/opt/conda/lib/python3.8/site-packages/lhotse/cut/set.py", line 662, in return (cut.to_dict() for cut in self) File "/opt/conda/lib/python3.8/site-packages/lhotse/lazy.py", line 165, in values yield from self File "/opt/conda/lib/python3.8/site-packages/lhotse/lazy.py", line 465, in iter for cuts in self.iterator: File "/opt/conda/lib/python3.8/site-packages/lhotse/cut/set.py", line 3263, in _trim_to_supervisions_single return cuts.trim_to_supervisions( File "/opt/conda/lib/python3.8/site-packages/lhotse/cut/base.py", line 492, in trim_to_supervisions assert ( AssertionError: Trimmed cut has supervisions with different channels. Either set keep_all_channels=True to keep original channels or keep_overlapping=False to retain only 1 supervision per trimmed cut.

This error is solved once I use following command

cut_set = cut_set.trim_to_supervisions(keep_overlapping=False,keep_all_channels=True)

But it is strange because all my waves are mono with single channel...

Could you explain maybe what is meaning of option ,keep_all_channels=True

desh2608 commented 8 months ago

It may be possible that your original recordings and supervisions have some issues. I would suggest:

from lhotse.utils import fix_manifests

recordings, supervisions = fix_manifests(recordings, supervisions)
cut_set = CutSet.from_manifests(recordings, supervisions)

Can you do this before trim_to_supervisions(keep_overlapping=False) and see if the AssertionError still happens?

AlexandderGorodetski commented 8 months ago

lhotse.utils does not include fix_manifestes

Do you mean to use

from lhotse.qa import fix_manifests


AlexandderGorodetski commented 8 months ago

Or maybe I should use more advanced version of lhotse ?

desh2608 commented 8 months ago

Sorry. I meant lhotse.qa

AlexandderGorodetski commented 8 months ago

Guys, could you please just approve

Following code

        cut_set = CutSet.from_manifests(

I will replace with the following one

        recordings, supervisions = fix_manifests(m["recordings"], m["supervisions"])
        cut_set = CutSet.from_manifests(

Could you please approve that it is OK with you

AlexandderGorodetski commented 8 months ago

Guys, thank you so much for your support. It seems that fix_manifests functions solved all my issues. I guess it is something similart to fix_data_dir in Kaldi. I recommend to add this change to tedlium recipe. I will be happy to do it, but I am not sure if I have the permission to create push request.

desh2608 commented 8 months ago

All Lhotse recipes already have it: https://github.com/lhotse-speech/lhotse/pull/1128. This means that any icefall ASR recipe which calls Lhotse based manifest preparation would have it by default.