k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
841 stars 273 forks source link

Training Zipformer Transducer on 8k sample rate dataset #1518

Closed bharathraj-v closed 3 months ago

bharathraj-v commented 4 months ago

Hi

I modified the librispeech zipformer example to facilitate an 8k dataset of around 100 hours. I created a separate datamodule which returns the train and valid cuts and dataloaders. I opted for DynamicBucketingSampler, and for the K2SpeechRecognitionDataset, PrecomputedFeatures as the input strategy and at first tried computing the features via compute_and_store_features() with Lhotse FBank extractor as FbankConfig(num_mel_bins=80, sampling_rate=8000, device="cuda") and it kept returning the following error:

ValueError: lilcom: Length of string was too short
[extra info] When calling: MonoCut.load_features(args=(MonoCut(id='21496', start=0.0, duration=29.74, channel=0, supervisions=[SupervisionSegment(id='21496', recording_id='7728ed9d-28d3-46c2-9287-261f6eb49773', start=0.0, duration=29.74, channel=0, text='తాసిల్దార్ గా రామదాసు మీద ఉన్నది అట్లాగే రాముని ఆలయానికి ఒక గోపురం ప్రకారం మండపం కట్టిం ఆలయాన్ని బాగు చేయవలసిన అవసరం ర్పడిం దీని కోసం ఒక నాడు గోపన్న మా ఊర్లో', language=None, speaker=None, gender=None, custom=None, alignment=None)], features=Features(type='kaldi-fbank', num_frames=2974, num_features=80, frame_shift=0.01, sampling_rate=8000, start=0.0, duration=29.74, storage_type='lilcom_chunky', storage_path='/home/azureuser/users/bharath/datasets/telugu_data_100hrs/feats-2.lca', storage_key='35788916,45765,44290,44628,44178,44696,42032', recording_id='None', channels=0), recording=Recording(id='7728ed9d-28d3-46c2-9287-261f6eb49773', sources=[AudioSource(type='file', channels=[0], source='/home/azureuser/users/bharath/datasets/telugu_data_100hrs/wav_8k/7728ed9d-28d3-46c2-9287-261f6eb49773.wav')], sampling_rate=8000, num_samples=237920, duration=29.74, channel_ids=[0], transforms=None), custom={'dataloading_info': {'rank': 0, 'world_size': 1, 'worker_id': None}}),) kwargs={})
[extra info] When calling: MixedCut.load_features(args=(MixedCut(id='23b8c1e9-3924-56de-3eb1-3b9046685257', tracks=[MixTrack(cut=MonoCut(id='21496', start=0.0, duration=29.74, channel=0, supervisions=[SupervisionSegment(id='21496', recording_id='7728ed9d-28d3-46c2-9287-261f6eb49773', start=0.0, duration=29.74, channel=0, text='తాసిల్దార్ గా రామదాసు మీ
ద ఉన్నది అట్లాగే రాముని ఆలయానికి ఒక గోపురం ప్రకారం మండపం కట్టిం ఆలయాన్ని బాగు చేయవలసిన అవసరం ఏర్పడిం దీని కోసం ఒక నాడు గోపన్న మా ఊర్లో', language=None, speaker=None, gender=None, custom=None, alignment=None)], features=Features(type='kaldi-fbank', num_frames=2974, num_features=80, frame_shift=0.01, sampling_rate=8000, start=0.0, duration=29.74, storage_type='lilcom_chunky', storage_path='/home/azureuser/users/bharath/datasets/telugu_data_100hrs/feats-2.lca', storage_key='35788916,45765,44290,44628,44178,44696,42032', recording_id='None', channels=0), recording=Recording(id='7728ed9d-28d3-46c2-9287-261f6eb49773', sources=[AudioSource(type='file', channels=[0], source='/home/azureuser/users/bharath/datasets/telugu_data_100hrs/wav_8k/7728ed9d-28d3-46c2-9287-261f6eb49773.wav')], sampling_rate=8000, num_samples=237920, duration=29.74, channel_ids=[0], transforms=None), custom={'dataloading_info': {'rank': 0, 'world_size': 1, 'worker_id': None}}), type='MonoCut', offset=0.0, snr=None), MixTrack(cut=PaddingCut(id='bdd640fb-0667-1ad1-1c80-317fa3b1799d', duration=0.0, sampling_rate=8000, feat_value=-23.025850929940457, num_frames=0, num_features=80, frame_shift=0.01, num_samples=0, video=None, custom=None), type='PaddingCut', offset=29.74, snr=None)], transforms=None),) kwargs={})

When I used KaldifeatFBank as the extractor via extractor = KaldifeatFbank(KaldifeatFbankConfig(device='cuda')) and resampled the cutset to 16k before using compute_and_store_features(), there weren't any errors and the training seemed to start. But then, after a few epochs in training, I'm facing a completely new error "Too many grads were not finite" and when I passed --inf-check True, I faced another error.

I'm having a hard time understanding what's going on. While training Zipformer on 8k sr dataset, what do I need to be wary of? Am I missing something in my approach? Guidance regarding this would be of great help.

Thanks!

JinZr commented 4 months ago

hi,

for the lilcom: Length of string was too short problem, you can find a previous issue in the original lilcom repo here: https://github.com/danpovey/lilcom/issues/47

for the Too many grads were not finite, we might need more detail of your training setup.

best jin

yaozengwei commented 4 months ago

Hi, could you also show the log when you run the training script with --inf-check=True?

bharathraj-v commented 4 months ago

Hi, could you also show the log when you run the training script with --inf-check=True?

Traceback (most recent call last):
  File "/home/azureuser/users/bharath/icefall/egs/librispeech/ASR/zipformer/train.py", line 1414, in <module>
    main()
  File "/home/azureuser/users/bharath/icefall/egs/librispeech/ASR/zipformer/train.py", line 1407, in main
    run(rank=0, world_size=1, args=args)
  File "/home/azureuser/users/bharath/icefall/egs/librispeech/ASR/zipformer/train.py", line 1285, in run
    train_one_epoch(
  File "/home/azureuser/users/bharath/icefall/egs/librispeech/ASR/zipformer/train.py", line 961, in train_one_epoch
    loss, loss_info = compute_loss(
                      ^^^^^^^^^^^^^
  File "/home/azureuser/users/bharath/icefall/egs/librispeech/ASR/zipformer/train.py", line 806, in compute_loss
    simple_loss, pruned_loss, ctc_loss = model(
                                         ^^^^^^
  File "/anaconda/envs/k2_icefall/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/anaconda/envs/k2_icefall/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1561, in _call_impl
    result = forward_call(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/azureuser/users/bharath/icefall/egs/librispeech/ASR/zipformer/model.py", line 326, in forward
    encoder_out, encoder_out_lens = self.forward_encoder(x, x_lens)
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/azureuser/users/bharath/icefall/egs/librispeech/ASR/zipformer/model.py", line 132, in forward_encoder
    x, x_lens = self.encoder_embed(x, x_lens)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/anaconda/envs/k2_icefall/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/anaconda/envs/k2_icefall/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1561, in _call_impl
    result = forward_call(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/azureuser/users/bharath/icefall/egs/librispeech/ASR/zipformer/subsampling.py", line 309, in forward
    x = self.conv(x)
        ^^^^^^^^^^^^
  File "/anaconda/envs/k2_icefall/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/anaconda/envs/k2_icefall/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1561, in _call_impl
    result = forward_call(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/anaconda/envs/k2_icefall/lib/python3.11/site-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
            ^^^^^^^^^^^^^
  File "/anaconda/envs/k2_icefall/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/anaconda/envs/k2_icefall/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1574, in _call_impl
    hook_result = hook(self, args, result)
                  ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/azureuser/users/bharath/icefall/icefall/hooks.py", line 41, in forward_hook
    raise ValueError(
ValueError: The sum of encoder_embed.conv.0.output is not finite: tensor([[[[nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          ...,
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan]],

         [[nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          ...,
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan]],

         [[nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          ...,
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan]],

         ...,

         [[nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          ...,
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan]],

         [[nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],

for the Too many grads were not finite, we might need more detail of your training setup.

I'm training on 100 hours of Telugu language data. I haven't changed any defaults from the example other than the librispeech data module and the tokenizer, which I changed to a 1024 vocab bpe tokenizer trained on the same data with character coverage 1. I manually filtered the data to be >2s and <30s.

100 hr dataset I'm using is not as clean as librispeech and mightly be slightly more challenging, I've discarded the audio len vs text len outliers, but since the dataset is more unclean and challenging, is it a bad idea to use with librispeech/ASR/zipformer example? Also, what is a gt wer for a good enough dataset that can be used for zipformer training? <3%?

yaozengwei commented 4 months ago

From the log "encoder_embed.conv.0.output is not finite", I doubt there might be some inf values in the input features. Could you check that?

bharathraj-v commented 4 months ago

Hi, sorry for the late reply. Here's how I'm computing the input features:

train_set = {
    "recordings" :  RecordingSet.from_file("/home/azureuser/users/bharath/datasets/telugu_data_100hrs/recordings_train.jsonl.gz"),
    "supervisions" : SupervisionSet.from_file("/home/azureuser/users/bharath/datasets/telugu_data_100hrs/supervisions_train.jsonl.gz")
}
cuts_train = CutSet.from_manifests(**train_set).trim_to_supervisions()
extractor = KaldifeatFbank(KaldifeatFbankConfig(device='cuda'))
cuts_train = cuts_train.resample(16000).compute_and_store_features_batch(
    extractor=extractor,
    storage_path="/home/azureuser/users/bharath/datasets/telugu_data_100hrs/",
    num_workers=12,
)

Here's how I checked if there are any input features:

inf_check = []

for i in range(len(cuts_train)):
    d = cuts_train[i].load_features()
    d = d.flatten()
    inf_check.append(np.isinf(d[0]).any())

print(len([i for i in inf_check if i == True]))

The output was 0. So, if this method I used is valid for finding inf values in this case, I don't think there are any inf values in the input features. Is it possible that the inf values are being created in the training process due to some reason?

bharathraj-v commented 4 months ago

I was able to train the model after removing outliers via the duration/characters ratio via the threshold 0.115455 < dur/char < 0.685000 from a 200hr dataset similar to the one I've been trying to train on previously. I removed those utterances with those thresholds I got after observing the dur/char distrubution because, as far as I understand, the inf grads are being created during the training process due to utterances with bad lengths, (from this comment).

The training successfully completed and here's the tensorboard for that. tb-200hr

But when I used the exact same process for training a bigger 1k hour dataset that is very similar to the previous datasets I've used, I'm facing this error again when using --inf-check=True after 1 epoch.

2024-03-15 21:11:12,982 WARNING [hooks.py:69] The sum of encoder_embed.convnext.depthwise_conv.grad[0] is not finite
2024-03-15 21:11:12,982 WARNING [hooks.py:69] The sum of encoder_embed.convnext.grad[0] is not finite
2024-03-15 21:11:12,982 WARNING [hooks.py:69] The sum of encoder_embed.conv.grad[0] is not finite
2024-03-15 21:11:12,982 WARNING [hooks.py:69] The sum of encoder_embed.conv.9.grad[0] is not finite
2024-03-15 21:11:12,983 WARNING [hooks.py:69] The sum of encoder_embed.conv.8.grad[0] is not finite
2024-03-15 21:11:12,984 WARNING [hooks.py:79] The sum of encoder_embed.conv.7.weight.param_grad is not finite
2024-03-15 21:11:12,985 WARNING [hooks.py:79] The sum of encoder_embed.conv.7.bias.param_grad is not finite
2024-03-15 21:11:12,985 WARNING [hooks.py:69] The sum of encoder_embed.conv.7.grad[0] is not finite
2024-03-15 21:11:12,985 WARNING [hooks.py:69] The sum of encoder_embed.conv.6.grad[0] is not finite
2024-03-15 21:11:12,985 WARNING [hooks.py:69] The sum of encoder_embed.conv.5.grad[0] is not finite
2024-03-15 21:11:12,987 WARNING [hooks.py:79] The sum of encoder_embed.conv.4.weight.param_grad is not finite
2024-03-15 21:11:12,988 WARNING [hooks.py:79] The sum of encoder_embed.conv.4.bias.param_grad is not finite
2024-03-15 21:11:12,988 WARNING [hooks.py:69] The sum of encoder_embed.conv.4.grad[0] is not finite
2024-03-15 21:11:12,988 WARNING [hooks.py:69] The sum of encoder_embed.conv.3.grad[0] is not finite
2024-03-15 21:11:12,988 WARNING [hooks.py:69] The sum of encoder_embed.conv.2.grad[0] is not finite
2024-03-15 21:11:12,988 WARNING [hooks.py:69] The sum of encoder_embed.conv.1.grad[0] is not finite
2024-03-15 21:11:12,989 WARNING [hooks.py:69] The sum of encoder_embed.conv.0.grad[0] is not finite
2024-03-15 21:11:12,990 WARNING [hooks.py:79] The sum of encoder_embed.conv.0.weight.param_grad is not finite
2024-03-15 21:11:12,990 WARNING [hooks.py:79] The sum of encoder_embed.conv.0.bias.param_grad is not finite
2024-03-15 21:11:13,051 INFO [checkpoint.py:75] Saving checkpoint to runs/1000hr_2nd_run/bad-model-0.pt
2024-03-15 21:11:14,005 INFO [train.py:1348] Saving batch to runs/1000hr_2nd_run/batch-bdd640fb-0667-1ad1-1c80-317fa3b1799d.pt
2024-03-15 21:11:14,008 INFO [train.py:1354] features shape: torch.Size([29, 334, 80])
2024-03-15 21:11:14,009 INFO [train.py:1358] num tokens: 189
Traceback (most recent call last):
  File "/home/azureuser/users/bharath/icefall/egs/og-librispeech/ASR/zipformer/train.py", line 1421, in <module>
    main()
  File "/home/azureuser/users/bharath/icefall/egs/og-librispeech/ASR/zipformer/train.py", line 1414, in main
    run(rank=0, world_size=1, args=args)
  File "/home/azureuser/users/bharath/icefall/egs/og-librispeech/ASR/zipformer/train.py", line 1292, in run
    train_one_epoch(
  File "/home/azureuser/users/bharath/icefall/egs/og-librispeech/ASR/zipformer/train.py", line 962, in train_one_epoch
    loss, loss_info = compute_loss(
                     ^^^^^^^^^^^^^
  File "/home/azureuser/users/bharath/icefall/egs/og-librispeech/ASR/zipformer/train.py", line 807, in compute_loss
    simple_loss, pruned_loss, ctc_loss = model(
                                         ^^^^^^
  File "/anaconda/envs/k2_icefall/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/anaconda/envs/k2_icefall/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1561, in _call_impl
    result = forward_call(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/azureuser/users/bharath/icefall/egs/og-librispeech/ASR/zipformer/model.py", line 326, in forward
    encoder_out, encoder_out_lens = self.forward_encoder(x, x_lens)
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/azureuser/users/bharath/icefall/egs/og-librispeech/ASR/zipformer/model.py", line 132, in forward_encoder
    x, x_lens = self.encoder_embed(x, x_lens)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/anaconda/envs/k2_icefall/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/anaconda/envs/k2_icefall/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1561, in _call_impl
    result = forward_call(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/azureuser/users/bharath/icefall/egs/og-librispeech/ASR/zipformer/subsampling.py", line 309, in forward
    x = self.conv(x)
        ^^^^^^^^^^^^
  File "/anaconda/envs/k2_icefall/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/anaconda/envs/k2_icefall/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1561, in _call_impl
    result = forward_call(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/anaconda/envs/k2_icefall/lib/python3.11/site-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
            ^^^^^^^^^^^^^
  File "/anaconda/envs/k2_icefall/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/anaconda/envs/k2_icefall/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1574, in _call_impl
    hook_result = hook(self, args, result)
                  ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/azureuser/users/bharath/icefall/icefall/hooks.py", line 41, in forward_hook
    raise ValueError(
ValueError: The sum of encoder_embed.conv.0.output is not finite: tensor([[[[nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          ...,
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan]],

         [[nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          ...,
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan]],

         [[nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          ...,
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan]],

         ...,

         [[nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          ...,
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan]],

         [[nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          ...,

The tensorboard for that is tb-1000hr

I do see the training for 1k hours is less stable and also the loss does not seem to be as low as 0.1 as per comments on some other issues. The recipe and the method I used are identical for both runs so this error seems to be occurring due to the dataset quality. The steps I've taken to filter bad data out are the dur/char filtering I've mentioned before and removing all utterances above 12 seconds as they are in the 0.01 percentile for the data I'm using.

I'm just looking for a better understanding of what sort of bad data could lead to this error and what steps I could take to remove those instances. Any input or help regarding this would be really valuable!

bharathraj-v commented 4 months ago

@yaozengwei, @csukuangfj, @JinZr Any suggestions on what I could do to improve data quality would be really helpful. Should I try training CTC instead, could that help avoid this error?

JinZr commented 4 months ago

please try feature extraction using 1st gen kaldi first, and use lhotse to import the 1st kaldi data dir format to a lhotse compatible one, see if the same issue happens with the original kaldi feat extraction.

there's not much we can do with corrupted data other than just filtering out them.

bharathraj-v commented 4 months ago

Thanks for your prompt response @JinZr, I wanted to ask if you think it's possible that the data is alright and this issue is due to some other reason (like the dataset being too challenging for rnnt),

In that case, should I try the zipformer_ctc model?

JinZr commented 4 months ago

what’s the average duration of the recordings and are they all extremely noisy

Best Regards Jin

On Mon, 18 Mar 2024 at 20:01 Bharath Raj @.***> wrote:

Thanks for your prompt response @JinZr https://github.com/JinZr, I wanted to ask if you think it's possible that the data is alright and this issue is due to some other reason (like the dataset being too challenging for rnnt),

In that case, should I try the zipformer_ctc model?

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/1518#issuecomment-2003727318, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOON42BHDOZQEASJLR6M55TYY3JSDAVCNFSM6AAAAABDYFWRVGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBTG4ZDOMZRHA . You are receiving this because you were mentioned.Message ID: @.***>

bharathraj-v commented 4 months ago

what’s the average duration of the recordings and are they all extremely noisy

The average duration is 3.49s. The distribution is

count    949908.000000
mean          3.498416
std           1.497223
min           0.780000
25%           2.560000
50%           3.140000
75%           4.000000
90%           5.140000
95%           6.160000
99%           9.080000
max          56.200000

(This is before I filtered out dur/char outliers and audios <2s and >12s)

The data is not necessarily super noisy, but it inherits real-world background acoustic qualities, it's not clean audio but I believe not all of it is extremely noisy. In this case, if I use similar data but 10x its size (say 10k hours) can that be better?

JinZr commented 4 months ago

You could try filtering out utterances with a duration smaller than 1s and larger than 20s first.

training with utterances of 56 seconds would be harmful especially when you are at the beginning of the training process.

Best Regards Jin

On Mon, 18 Mar 2024 at 22:57 Bharath Raj @.***> wrote:

what’s the average duration of the recordings and are they all extremely noisy

The average duration is 3.49s. Their distribution is

count 949908.000000 mean 3.498416 std 1.497223 min 0.780000 25% 2.560000 50% 3.140000 75% 4.000000 90% 5.140000 95% 6.160000 99% 9.080000 max 56.200000

The data is not necessarily super noisy, but it inherits real-world background acoustic qualities, it's not clean audio but I believe not all of it is extremely noisy. In this case, if I use similar data but 10x its size (say 10k hours) can that be better?

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/1518#issuecomment-2004142093, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOON42CHLNE5I3LKV7K7DX3YY36D5AVCNFSM6AAAAABDYFWRVGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBUGE2DEMBZGM . You are receiving this because you were mentioned.Message ID: @.***>

bharathraj-v commented 4 months ago

Apologies, I mentioned this in the edit after responding. The values I shared are before the filtering, I did filter out dur/char outliers and audios <2s and >12s before the last run and it still crashed with the inf grad error.

JinZr commented 4 months ago

You can also just filter out all acoustic features containing nan, this is the only conclusion I can draw from the information you’ve provided so far.

if the rnnt model couldn’t converge well on this kind of data, I don’t see why training a CTC model would help.

also try doing feature extraction with other toolkits like i mentioned earlier, so you can make sure whether this is the problem with the toolkit or the data, maybe some of the wav files are broken but i can’t say for sure.

Best Regards

Jin

On Tue, 19 Mar 2024 at 00:03 Bharath Raj @.***> wrote:

Apologies, I mentioned this in the edit after responding. The values I shared are before the filtering, I did filter out dur/char outliers and audios <2s and >12s before the last run and it crashed with the inf grad error.

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/1518#issuecomment-2004326025, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOON42EYL55B53AEQWB43OLYY4F4LAVCNFSM6AAAAABDYFWRVGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBUGMZDMMBSGU . You are receiving this because you were mentioned.Message ID: @.***>

bharathraj-v commented 3 months ago

@JinZr, Thank you for your guidance! When I removed audios with durations >9s which was the 99 percentile for the data I was using, the error stopped. I also was using max duration 100s instead of 1000s and the low max dur could have also played a part, I'm not sure but the run with 1k max dur and removing audio <2s and >9s stopped the error.

image

if you have any insights/questions to ask us regarding our experiments, happy to answer

JinZr commented 3 months ago

cool, glad to hear everything works fine now. 🎉

best jin

On Mar 27, 2024, at 17:21, Bharath Raj @.***> wrote:

@JinZr https://github.com/JinZr, Thank you for your guidance! When I removed audios with durations >9s which was the 99 percentile for the data I was using, the error stopped. I also was using max duration 100s instead of 1000s and the low max dur could have also played a part, I'm not sure but the run with 1k max dur and removing audio <2s and >9s stopped the error.