Closed AlexandderGorodetski closed 8 months ago
Please print the shape of x and y and the value of x_lens on assertion failure.
I did the change that you requested. Moreover I afraid that the problem is that I use only 200 hours of data for trainin (but it is just for debugging, I will increase number of hours later). In any case following is the output after reduction number of paramters of the model. Unfortunately I see same error
023-10-23 17:54:46,703 INFO [train.py:1064] Training started
2023-10-23 17:54:46,721 INFO [train.py:1074] Device: cuda:0
2023-10-23 17:54:46,724 INFO [train.py:1083] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '4c05309499a08454997adf500b56dcc629e35ae5', 'k2-git-date': 'Tue Jul 25 16:23:36 2023', 'lhotse-version': '1.17.0.dev+git.a4701868.clean', 'torch-version': '1.13.0+cu116', 'torch-cuda-available': True, 'torch-cuda-version': '11.6', 'python-version': '3.8', 'icefall-git-branch': 'master', 'icefall-git-sha1': '52c24df-dirty', 'icefall-git-date': 'Wed Oct 18 12:36:14 2023', 'icefall-path': '/workspace/inputs/alexg/asr/src/models/k2/icefall', 'k2-path': '/opt/conda/lib/python3.8/site-packages/k2/init.py', 'lhotse-path': '/opt/conda/lib/python3.8/site-packages/lhotse/init.py', 'hostname': 'gpu-alex-speech-base-0-0', 'IP address': '10.244.2.49'}, 'world_size': 1, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 50, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp'), 'bpe_model': 'data/lang_bpe_500/bpe.model', 'base_lr': 0.04, 'lr_batches': 7500, 'lr_epochs': 5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'rnnt_type': 'regular', 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 1, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '1,1,1,1,1,1', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,128,128,128,128,128', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,256,256,256,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,192,192,192,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 1000, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'blank_id': 0, 'vocab_size': 500}
2023-10-23 17:54:46,724 INFO [train.py:1085] About to create model
2023-10-23 17:54:46,910 INFO [train.py:1089] Number of model parameters: 7672553
2023-10-23 17:54:48,466 INFO [asr_datamodule.py:353] About to get train cuts
/opt/conda/lib/python3.8/site-packages/smart_open/smart_open_lib.py:250: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function
warnings.warn(
/opt/conda/lib/python3.8/site-packages/lhotse/lazy.py:398: UserWarning: A lambda was passed to LazyFilter: it may prevent you from forking this process. If you experience issues with num_workers > 0 in torch.utils.data.DataLoader, try passing a regular function instead.
warnings.warn(
2023-10-23 17:54:48,661 INFO [asr_datamodule.py:185] Enable SpecAugment
2023-10-23 17:54:48,661 INFO [asr_datamodule.py:186] Time warp factor: 80
2023-10-23 17:54:48,661 INFO [asr_datamodule.py:202] About to get Musan cuts
2023-10-23 17:54:48,661 INFO [asr_datamodule.py:205] Enable MUSAN
2023-10-23 17:54:50,787 INFO [asr_datamodule.py:227] About to create train dataset
2023-10-23 17:54:50,788 INFO [asr_datamodule.py:253] Using DynamicBucketingSampler.
2023-10-23 17:54:53,354 INFO [asr_datamodule.py:273] About to create train dataloader
2023-10-23 17:54:53,355 INFO [asr_datamodule.py:360] About to get dev cuts
2023-10-23 17:54:53,356 INFO [asr_datamodule.py:293] About to create dev dataset
2023-10-23 17:54:53,361 INFO [asr_datamodule.py:312] About to create dev dataloader
2023-10-23 17:54:53,361 INFO [train.py:1257] Sanity check -- see if any of the batches in epoch 1 would cause OOM.
x shape: torch.Size([50, 1982, 80]), y shape: [ [ x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] ], x_lens: tensor([1982, 1154, 1120, 1091, 1083, 1052, 1043, 1034, 1024, 1023, 990, 962,
951, 951, 950, 949, 945, 936, 934, 922, 918, 914, 1143, 1122,
1090, 1074, 1063, 1046, 1038, 1028, 1017, 1017, 1017, 1016, 998, 984,
976, 974, 963, 962, 941, 939, 936, 933, 931, 928, 924, 924,
923], device='cuda:0', dtype=torch.int32)
2023-10-23 17:56:12,175 INFO [train.py:1235] Saving batch to zipformer/exp/batch-b8db0672-f42d-47cc-00d4-af5974273ca3.pt
2023-10-23 17:56:12,243 INFO [train.py:1241] features shape: torch.Size([50, 1982, 80])
2023-10-23 17:56:12,245 INFO [train.py:1245] num tokens: 1976
Traceback (most recent call last):
File "./zipformer/train.py", line 1308, in
What is the value of y.dim0, x.size(0) and x_len.size(0)?
2023-10-23 18:12:25,863 INFO [train.py:1257] Sanity check -- see if any of the batches in epoch 1 would cause OOM.
x shape: torch.Size([50, 1982, 80]), y shape: [ [ x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] [ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ] ], x_lens: tensor([1982, 1154, 1120, 1091, 1083, 1052, 1043, 1034, 1024, 1023, 990, 962,
951, 951, 950, 949, 945, 936, 934, 922, 918, 914, 1143, 1122,
1090, 1074, 1063, 1046, 1038, 1028, 1017, 1017, 1017, 1016, 998, 984,
976, 974, 963, 962, 941, 939, 936, 933, 931, 928, 924, 924,
923], device='cuda:0', dtype=torch.int32), y.dim0: 49, x.size(0): 50, x_lens.size(0): 49
2023-10-23 18:13:44,954 INFO [train.py:1235] Saving batch to zipformer/exp/batch-b8db0672-f42d-47cc-00d4-af5974273ca3.pt
2023-10-23 18:13:45,023 INFO [train.py:1241] features shape: torch.Size([50, 1982, 80])
2023-10-23 18:13:45,025 INFO [train.py:1245] num tokens: 1976
Traceback (most recent call last):
File "./zipformer/train.py", line 1308, in
I have a question. Maybe it is because I have
x contains an extra item. Please recheck your data.
Could you please recommend what exactly should I check. Maybe I should verify that all words are defined in the dictionary? Maybe I should check that I do not have empty utterances? Do you have some intuition of what can be the problem?
You have 50 items in the batch for x
but only 49 for x_lens
and y
. You need to check and make sure you are preparing the batches correctly.
Frankly, I use Tedlium recipe to build my ASR. I've just prepared manifests files in the same manner like Tedlium recipe does and I expected that the training will run properly. Would you recommend me what lines should I debug and I will do it.
You should put a breakpoint here and inspect the tensors feature, feature_lens, and y. You may also have go into the K2SpeechRecognitionDataset
in Lhotse to see where the issue is.
I see that for most batches the values of x.size(0) , x_lens.size(0), y.dim0 are exactly same. Maybe there is one or several problematic batches. I would like to skip them. Could you please recommend me how can I change the code in order to skip the problematic batches?
I would suggest that you try to identify the cause of the issue instead of just ignoring the problematic batches.
Guys, for faster debugging, I would like to decrease my training database from 200 hours to let's say 1 hour (of course I will care that the problem will not disappear). Is this OK with you?
Guys, is there someone here who implemented following code
train_dl = tedlium.train_dataloaders(
train_cuts, sampler_state_dict=sampler_state_dict
)
It seems that this code creates (rarely) batches with different x and y sizes.
Could you recommend me how can I debug it. Can I do it in small part of data, in order to optimize debugging time.
2023-10-23 17:56:12,175 INFO [train.py:1235] Saving batch to zipformer/exp/batch-b8db0672-f42d-47cc-00d4-af5974273ca3.pt
From your error log, you can use
batch = torch.load("zipformer/exp/batch-b8db0672-f42d-47cc-00d4-af5974273ca3.pt")
and inspect the above batch to find out why it causes the error.
There is no need to re-run it from scratch.
I slightly change the database. And I still have the same error. I tried to load the batch like you recommended:
len(batch["supervisions"]["text"]) 159 batch["inputs"].size() torch.Size([162, 617, 80])
It seems that in this special case I am missing 3 text examples.
Could you recommend about next steps?
Can we verify that filter bank extraction was performed properly.
Could you please recommend some sanity check for filter bank extractor.
We have detailed tests for the filterbank extractor and it is very unlikely there are errors with it. If you think there are issues with the filterbank extractor, please provide more information about it.
My suspicion is only about this command
cut_set = cut_set.trim_to_supervisions(keep_overlapping=False)
This command fails with following error
Traceback (most recent call last):
File "local/compute_fbank_tedlium_1.py", line 117, in keep_all_channels=True
to keep original channels or keep_overlapping=False
to retain only 1 supervision per trimmed cut.
This error is solved once I use following command
cut_set = cut_set.trim_to_supervisions(keep_overlapping=False,keep_all_channels=True)
But it is strange because all my waves are mono with single channel...
Could you explain maybe what is meaning of option ,keep_all_channels=True
It may be possible that your original recordings and supervisions have some issues. I would suggest:
from lhotse.utils import fix_manifests
recordings, supervisions = fix_manifests(recordings, supervisions)
cut_set = CutSet.from_manifests(recordings, supervisions)
Can you do this before trim_to_supervisions(keep_overlapping=False)
and see if the AssertionError still happens?
lhotse.utils does not include fix_manifestes
Do you mean to use
from lhotse.qa import fix_manifests
?
Or maybe I should use more advanced version of lhotse ?
Sorry. I meant lhotse.qa
Guys, could you please just approve
Following code
cut_set = CutSet.from_manifests(
recordings=m["recordings"],
supervisions=m["supervisions"],
)
I will replace with the following one
recordings, supervisions = fix_manifests(m["recordings"], m["supervisions"])
cut_set = CutSet.from_manifests(
recordings=recordings,
supervisions=supervisions,
)
Could you please approve that it is OK with you
Guys, thank you so much for your support. It seems that fix_manifests functions solved all my issues. I guess it is something similart to fix_data_dir in Kaldi. I recommend to add this change to tedlium recipe. I will be happy to do it, but I am not sure if I have the permission to create push request.
All Lhotse recipes already have it: https://github.com/lhotse-speech/lhotse/pull/1128. This means that any icefall ASR recipe which calls Lhotse based manifest preparation would have it by default.
Hello guys,
I am suing tedlium to train K2 model on my inhouse data.
I get following error:
Do you have some recommendataion?
2023-10-23 17:08:05,502 INFO [train.py:1064] Training started 2023-10-23 17:08:05,506 INFO [train.py:1074] Device: cuda:0 2023-10-23 17:08:05,509 INFO [train.py:1083] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '4c05309499a08454997adf500b56dcc629e35ae5', 'k2-git-date': 'Tue Jul 25 16:23:36 2023', 'lhotse-version': '1.17.0.dev+git.a4701868.clean', 'torch-version': '1.13.0+cu116', 'torch-cuda-available': True, 'torch-cuda-version': '11.6', 'python-version': '3.8', 'icefall-git-branch': 'master', 'icefall-git-sha1': '52c24df-dirty', 'icefall-git-date': 'Wed Oct 18 12:36:14 2023', 'icefall-path': '/workspace/inputs/alexg/asr/src/models/k2/icefall', 'k2-path': '/opt/conda/lib/python3.8/site-packages/k2/init.py', 'lhotse-path': '/opt/conda/lib/python3.8/site-packages/lhotse/init.py', 'hostname': 'gpu-alex-speech-base-0-0', 'IP address': '10.244.2.49'}, 'world_size': 1, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 50, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp'), 'bpe_model': 'data/lang_bpe_500/bpe.model', 'base_lr': 0.04, 'lr_batches': 7500, 'lr_epochs': 5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'rnnt_type': 'regular', 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 1, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 1000, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'blank_id': 0, 'vocab_size': 500} 2023-10-23 17:08:05,509 INFO [train.py:1085] About to create model 2023-10-23 17:08:05,943 INFO [train.py:1089] Number of model parameters: 65549011 2023-10-23 17:08:09,270 INFO [asr_datamodule.py:353] About to get train cuts /opt/conda/lib/python3.8/site-packages/smart_open/smart_open_lib.py:250: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function warnings.warn( /opt/conda/lib/python3.8/site-packages/lhotse/lazy.py:398: UserWarning: A lambda was passed to LazyFilter: it may prevent you from forking this process. If you experience issues with num_workers > 0 in torch.utils.data.DataLoader, try passing a regular function instead. warnings.warn( 2023-10-23 17:08:09,547 INFO [asr_datamodule.py:185] Enable SpecAugment 2023-10-23 17:08:09,547 INFO [asr_datamodule.py:186] Time warp factor: 80 2023-10-23 17:08:09,547 INFO [asr_datamodule.py:202] About to get Musan cuts 2023-10-23 17:08:09,547 INFO [asr_datamodule.py:205] Enable MUSAN 2023-10-23 17:08:11,741 INFO [asr_datamodule.py:227] About to create train dataset 2023-10-23 17:08:11,741 INFO [asr_datamodule.py:253] Using DynamicBucketingSampler. 2023-10-23 17:08:14,342 INFO [asr_datamodule.py:273] About to create train dataloader 2023-10-23 17:08:14,343 INFO [asr_datamodule.py:360] About to get dev cuts 2023-10-23 17:08:14,344 INFO [asr_datamodule.py:293] About to create dev dataset 2023-10-23 17:08:14,349 INFO [asr_datamodule.py:312] About to create dev dataloader 2023-10-23 17:08:14,350 INFO [train.py:1257] Sanity check -- see if any of the batches in epoch 1 would cause OOM. 2023-10-23 17:09:32,540 INFO [train.py:1235] Saving batch to zipformer/exp/batch-b8db0672-f42d-47cc-00d4-af5974273ca3.pt 2023-10-23 17:09:32,614 INFO [train.py:1241] features shape: torch.Size([50, 1982, 80]) 2023-10-23 17:09:32,615 INFO [train.py:1245] num tokens: 1976 Traceback (most recent call last): File "./zipformer/train.py", line 1308, in
main()
File "./zipformer/train.py", line 1301, in main
run(rank=0, world_size=1, args=args)
File "./zipformer/train.py", line 1156, in run
scan_pessimistic_batches_for_oom(
File "./zipformer/train.py", line 1265, in scan_pessimistic_batches_foroom
loss, = compute_loss(2023-10-23 17:08:05,502 INFO [train.py:1064] Training started
2023-10-23 17:08:05,506 INFO [train.py:1074] Device: cuda:0
2023-10-23 17:08:05,509 INFO [train.py:1083] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '4c05309499a08454997adf500b56dcc629e35ae5', 'k2-git-date': 'Tue Jul 25 16:23:36 2023', 'lhotse-version': '1.17.0.dev+git.a4701868.clean', 'torch-version': '1.13.0+cu116', 'torch-cuda-available': True, 'torch-cuda-version': '11.6', 'python-version': '3.8', 'icefall-git-branch': 'master', 'icefall-git-sha1': '52c24df-dirty', 'icefall-git-date': 'Wed Oct 18 12:36:14 2023', 'icefall-path': '/workspace/inputs/alexg/asr/src/models/k2/icefall', 'k2-path': '/opt/conda/lib/python3.8/site-packages/k2/init.py', 'lhotse-path': '/opt/conda/lib/python3.8/site-packages/lhotse/init.py', 'hostname': 'gpu-alex-speech-base-0-0', 'IP address': '10.244.2.49'}, 'world_size': 1, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 50, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp'), 'bpe_model': 'data/lang_bpe_500/bpe.model', 'base_lr': 0.04, 'lr_batches': 7500, 'lr_epochs': 5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'rnnt_type': 'regular', 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 1, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 1000, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'blank_id': 0, 'vocab_size': 500}
2023-10-23 17:08:05,509 INFO [train.py:1085] About to create model
2023-10-23 17:08:05,943 INFO [train.py:1089] Number of model parameters: 65549011
2023-10-23 17:08:09,270 INFO [asr_datamodule.py:353] About to get train cuts
/opt/conda/lib/python3.8/site-packages/smart_open/smart_open_lib.py:250: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function
warnings.warn(
/opt/conda/lib/python3.8/site-packages/lhotse/lazy.py:398: UserWarning: A lambda was passed to LazyFilter: it may prevent you from forking this process. If you experience issues with num_workers > 0 in torch.utils.data.DataLoader, try passing a regular function instead.
warnings.warn(
2023-10-23 17:08:09,547 INFO [asr_datamodule.py:185] Enable SpecAugment
2023-10-23 17:08:09,547 INFO [asr_datamodule.py:186] Time warp factor: 80
2023-10-23 17:08:09,547 INFO [asr_datamodule.py:202] About to get Musan cuts
2023-10-23 17:08:09,547 INFO [asr_datamodule.py:205] Enable MUSAN
2023-10-23 17:08:11,741 INFO [asr_datamodule.py:227] About to create train dataset
2023-10-23 17:08:11,741 INFO [asr_datamodule.py:253] Using DynamicBucketingSampler.
2023-10-23 17:08:14,342 INFO [asr_datamodule.py:273] About to create train dataloader
2023-10-23 17:08:14,343 INFO [asr_datamodule.py:360] About to get dev cuts
2023-10-23 17:08:14,344 INFO [asr_datamodule.py:293] About to create dev dataset
2023-10-23 17:08:14,349 INFO [asr_datamodule.py:312] About to create dev dataloader
2023-10-23 17:08:14,350 INFO [train.py:1257] Sanity check -- see if any of the batches in epoch 1 would cause OOM.
2023-10-23 17:09:32,540 INFO [train.py:1235] Saving batch to zipformer/exp/batch-b8db0672-f42d-47cc-00d4-af5974273ca3.pt
2023-10-23 17:09:32,614 INFO [train.py:1241] features shape: torch.Size([50, 1982, 80])
2023-10-23 17:09:32,615 INFO [train.py:1245] num tokens: 1976
Traceback (most recent call last):
File "./zipformer/train.py", line 1308, in
main()
File "./zipformer/train.py", line 1301, in main
run(rank=0, world_size=1, args=args)
File "./zipformer/train.py", line 1156, in run
scan_pessimistic_batches_for_oom(
File "./zipformer/train.py", line 1265, in scan_pessimistic_batches_foroom
loss, = compute_loss(
File "./zipformer/train.py", line 766, in compute_loss
simple_loss, pruned_loss = model(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, *kwargs)
File "/workspace/inputs/alexg/asr/src/models/k2/icefall/egs/tedlium3/ASR/zipformer/model.py", line 128, in forward
assert x.size(0) == x_lens.size(0) == y.dim0
AssertionError
File "./zipformer/train.py", line 766, in compute_loss
simple_loss, pruned_loss = model(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(input, **kwargs)
File "/workspace/inputs/alexg/asr/src/models/k2/icefall/egs/tedlium3/ASR/zipformer/model.py", line 128, in forward
assert x.size(0) == x_lens.size(0) == y.dim0
AssertionError