Some problems when loading the TedLium3 dataset for transducer-stateless training

luomingshuang commented 2 years ago

Currently, I am trying to build a transducer-stateless recipe based on Tedlium3 for icefall. This is the PR. (https://github.com/k2-fsa/icefall/pull/183). This PR shows the concrete codes for processing and loading the tedlium dataset. In train.py, we also use the function remove_short_and_long_utt for filtering.

When I use the following codes for loading the data in train.py:

tedlium = TedLiumAsrDataModule(args)
train_cuts = tedlium.train_cuts()

There is an error:

2022-01-26 17:06:37,134 INFO [train.py:577] About to create model
2022-01-26 17:06:37,804 INFO [train.py:581] Number of model parameters: 84007924
2022-01-26 17:06:41,550 INFO [asr_datamodule.py:341] About to get train cuts
2022-01-26 17:06:53,750 INFO [train.py:618] Before removing short and long utterances: 7053
2022-01-26 17:06:53,751 INFO [train.py:619] After removing short and long utterances: 0
2022-01-26 17:06:53,751 INFO [train.py:620] Removed 7053 utterances (100.00000%)
2022-01-26 17:06:53,751 INFO [asr_datamodule.py:176] About to get Musan cuts
2022-01-26 17:06:55,585 INFO [asr_datamodule.py:183] Enable MUSAN
2022-01-26 17:06:55,586 INFO [asr_datamodule.py:208] Enable SpecAugment
2022-01-26 17:06:55,586 INFO [asr_datamodule.py:209] Time warp factor: 80
2022-01-26 17:06:55,586 INFO [asr_datamodule.py:224] About to create train dataset
2022-01-26 17:06:55,586 INFO [asr_datamodule.py:252] Using BucketingSampler.
Traceback (most recent call last):
  File "transducer_stateless/train.py", line 733, in <module>
    main()
  File "transducer_stateless/train.py", line 726, in main
    run(rank=0, world_size=1, args=args)
  File "transducer_stateless/train.py", line 622, in run
    train_dl = tedlium.train_dataloaders(train_cuts)
  File "/ceph-meixu/luomingshuang/icefall/egs/tedlium3/ASR/transducer_stateless/asr_datamodule.py", line 253, in train_dataloaders
    train_sampler = BucketingSampler(
  File "/ceph-meixu/luomingshuang/anaconda3/envs/k2-python/lib/python3.8/site-packages/lhotse-1.0.0.dev0+git.6a3192a.clean-py3.8.egg/lhotse/dataset/sampling/bucketing.py", line 108, in __init__
  File "/ceph-meixu/luomingshuang/anaconda3/envs/k2-python/lib/python3.8/site-packages/lhotse-1.0.0.dev0+git.6a3192a.clean-py3.8.egg/lhotse/dataset/sampling/bucketing.py", line 392, in create_buckets_equal_duration
  File "/ceph-meixu/luomingshuang/anaconda3/envs/k2-python/lib/python3.8/site-packages/lhotse-1.0.0.dev0+git.6a3192a.clean-py3.8.egg/lhotse/dataset/sampling/bucketing.py", line 430, in _create_buckets_equal_duration_single
IndexError: pop from empty list

So I read the cuts_train.json.gz with vim data/fbank/cuts_train.json.gz, and shows as follows:

As shows in the above picture, the each cut's duration is too long. So the samples are filtered.

For fixing this issue, I try to use the following codes for loading the data:

tedlium = TedLiumAsrDataModule(args)
train_cuts = tedlium.train_cuts()
train_cuts = train_cuts.trim_to_supervisions()

There is an error:

2022-01-26 17:17:15,419 INFO [train.py:577] About to create model
2022-01-26 17:17:16,063 INFO [train.py:581] Number of model parameters: 84007924
2022-01-26 17:17:19,781 INFO [asr_datamodule.py:341] About to get train cuts
2022-01-26 17:18:35,665 INFO [train.py:618] Before removing short and long utterances: 804789
2022-01-26 17:18:35,665 INFO [train.py:619] After removing short and long utterances: 801989
2022-01-26 17:18:35,665 INFO [train.py:620] Removed 2800 utterances (0.34792%)
2022-01-26 17:18:35,665 INFO [asr_datamodule.py:176] About to get Musan cuts
2022-01-26 17:18:38,257 INFO [asr_datamodule.py:183] Enable MUSAN
2022-01-26 17:18:38,257 INFO [asr_datamodule.py:208] Enable SpecAugment
2022-01-26 17:18:38,257 INFO [asr_datamodule.py:209] Time warp factor: 80
2022-01-26 17:18:38,257 INFO [asr_datamodule.py:224] About to create train dataset
2022-01-26 17:18:38,257 INFO [asr_datamodule.py:252] Using BucketingSampler.
2022-01-26 17:18:41,597 INFO [asr_datamodule.py:268] About to create train dataloader
2022-01-26 17:18:41,597 INFO [asr_datamodule.py:348] About to get dev cuts
2022-01-26 17:18:41,695 INFO [asr_datamodule.py:289] About to create dev dataset
2022-01-26 17:18:41,696 INFO [asr_datamodule.py:308] About to create dev dataloader
2022-01-26 17:18:41,697 INFO [train.py:685] Sanity check -- see if any of the batches in epoch 0 would cause OOM.
Traceback (most recent call last):
  File "transducer_stateless/train.py", line 733, in <module>
    main()
  File "transducer_stateless/train.py", line 726, in main
    run(rank=0, world_size=1, args=args)
  File "transducer_stateless/train.py", line 628, in run
    scan_pessimistic_batches_for_oom(
  File "transducer_stateless/train.py", line 690, in scan_pessimistic_batches_for_oom
    batch = train_dl.dataset[cuts]
  File "/ceph-meixu/luomingshuang/anaconda3/envs/k2-python/lib/python3.8/site-packages/lhotse-1.0.0.dev0+git.6a3192a.clean-py3.8.egg/lhotse/dataset/speech_recognition.py", line 99, in __getitem__
  File "/ceph-meixu/luomingshuang/anaconda3/envs/k2-python/lib/python3.8/site-packages/lhotse-1.0.0.dev0+git.6a3192a.clean-py3.8.egg/lhotse/dataset/speech_recognition.py", line 206, in validate_for_asr
AssertionError: Supervisions starting before the cut are not supported for ASR (sup id: ClayShirky_2005G-126, cut id: 7840abc4-ea04-003f-4314-ee1381d764dd)

As shows in the above picture, some short cut's start time may be negative (<0).

For fixing this issue, I try to use the following codes for loading the data:

tedlium = TedLiumAsrDataModule(args)
train_cuts = tedlium.train_cuts()
train_cuts = train_cuts.trim_to_supervisions(keep_overlapping=False)

It can load data with the above codes normally. But it takes much time to generate a batch in GPU. And the volatile GPU-Util is 0 for a long time.

I am trying to change long cuts to short cuts before computing the fbank feature. (In compute_fbank_tedlium.py)

            cut_set = CutSet.from_manifests(
                recordings=m["recordings"],
                supervisions=m["supervisions"],
            ).trim_to_supervisions(keep_overlapping=False)

Are there some advices for this issue? Thanks!

luomingshuang commented 2 years ago

Now, when I transform the long cuts into the short cuts before computing fbank with the following codes, the GPU utilization gets higher and the training process gets normal:

           cut_set = CutSet.from_manifests(
                recordings=m["recordings"],
                supervisions=m["supervisions"],
            ).trim_to_supervisions(keep_overlapping=False)

pzelasko commented 2 years ago

You basically discovered my intended use of these APIs the hard way :) but you're not the first person to ask these questions, which makes me wonder how I could improve this to make it less confusing for Lhotse users.

danpovey commented 2 years ago

Can someone please explain to me:

what was going on here initially (why the error " Supervisions starting before the cut are not supported for ASR" happened)

Why it was slow when he did:

tedlium = TedLiumAsrDataModule(args)
train_cuts = tedlium.train_cuts()
train_cuts = train_cuts.trim_to_supervisions(keep_overlapping=False)

(and why it helped when he did trim_to_supervisions(keep_overlapping=False) before computing fbank),

and why we need "keep_overlapping=False" in the Tedlium setup. I would have thought overlapping supervisions would be rare since it is only one speaker, and it would be harmless to keep small overlaps.

pzelasko commented 2 years ago

Sure.

what was going on here initially (why the error " Supervisions starting before the cut are not supported for ASR" happened)

When he ran .trim_to_supervisions(), it creates one cut per each supervision present in the CutSet. By default, these cuts will contain all supervisions that are overlapping with them by at least 1% of duration. This is to make sure that the users are aware there is possibly overlapping speech in the data, and may either filter these cuts out, or use the flag keep_overlapping=False, in which case there will only be one supervision per cut. I opted not to make the latter case default as it could be disastrous with corpora where overlap is common.

Why it was slow when he did:
tedlium = TedLiumAsrDataModule(args)

train_cuts = tedlium.train_cuts()

train_cuts = train_cuts.trim_to_supervisions(keep_overlapping=False)

This was slow because the result of tedlium.train_cuts() contains long cuts (30min?) with a lot of supervisions. The current implementation of trim_to_supervisions creates an interval tree of supervisions for each cut to "quickly" determine which ones are overlapping. Quite possibly it's not the fastest implementation we can get, but at least it's not quadratic. There might be some overhead from creating a lot of Python objects too, I'm not sure without a profile.

(and why it helped when he did trim_to_supervisions(keep_overlapping=False) before computing fbank),

It shifted the "cost" of trimming to supervisions to an earlier stage, so that when he runs the training scripts, he simply reads "precomputed trims" of cuts.

and why we need "keep_overlapping=False" in the Tedlium setup. I would have thought overlapping supervisions would be rare since it is only one speaker, and it would be harmless to keep small overlaps.

It's not so easy -- if an overlapping supervision goes "outside" of the cut, we are missing a part of audio that may correspond to some text, so we'd be introducing bad training examples. This can be fixed by extending the cut to cover full overlapping supervision (I don't think we have a method for this yet).

Unfortunately, all it takes is one bad cut to get into these issues, unless we check for these things explicitly in the data prep scripts rather than in K2SpeechRecognitionDataset.

danpovey commented 2 years ago

Why not just let the cuts themselves overlap and keep just one supervision per cut, with no overlap detection?

On Thursday, January 27, 2022, Piotr Żelasko @.***> wrote:

Sure.

what was going on here initially (why the error " Supervisions starting before the cut are not supported for ASR" happened)

When he ran .trim_to_supervisions(), it creates one cut per each supervision present in the CutSet. By default, these cuts will contain all supervisions that are overlapping with them by at least 1% of duration. This is to make sure that the users are aware there is possibly overlapping speech in the data, and may either filter these cuts out, or use the flag keep_overlapping=False, in which case there will only be one supervision per cut. I opted not to make the latter case default as it could be disastrous with corpora where overlap is common.

Why it was slow when he did:

tedlium = TedLiumAsrDataModule(args)

train_cuts = tedlium.train_cuts()

train_cuts = train_cuts.trim_to_supervisions(keep_overlapping=False)

This was slow because the result of tedlium.train_cuts() contains long cuts (30min?) with a lot of supervisions. The current implementation of trim_to_supervisions creates an interval tree of supervisions for each cut to "quickly" determine which ones are overlapping. Quite possibly it's not the fastest implementation we can get, but at least it's not quadratic. There might be some overhead from creating a lot of Python objects too, I'm not sure without a profile.

(and why it helped when he did trim_to_supervisions(keep_overlapping=False) before computing fbank),

It shifted the "cost" of trimming to supervisions to an earlier stage, so that when he runs the training scripts, he simply reads "precomputed trims" of cuts.

and why we need "keep_overlapping=False" in the Tedlium setup. I would have thought overlapping supervisions would be rare since it is only one speaker, and it would be harmless to keep small overlaps.

It's not so easy -- if an overlapping supervision goes "outside" of the cut, we are missing a part of audio that may correspond to some text, so we'd be introducing bad training examples. This can be fixed by extending the cut to cover full overlapping supervision (I don't think we have a method for this yet).

Unfortunately, all it takes is one bad cut to get into these issues, unless we check for these things explicitly in the data prep scripts rather than in K2SpeechRecognitionDataset.

— Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/548#issuecomment-1022847833, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO4IGWEUSBF25GIEEFLUYDF4TANCNFSM5M3AZXMQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

pzelasko commented 2 years ago

Think of diarization: if you want to detect overlapping turns, it makes more sense to me when you have that information in a single training example (cut) encoded as multiple (overlapping) supervisions.

danpovey commented 2 years ago

Mm yes, but surely it wouldn't hurt to add another mode intended for ASR...

On Thursday, January 27, 2022, Piotr Żelasko @.***> wrote:

Think of diarization: if you want to detect overlapping turns, it makes more sense to me when you have that information in a single training example (cut) encoded as multiple (overlapping) supervisions.

— Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/548#issuecomment-1023255670, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOYDBWQOEA3NAS77WGLUYFHUZANCNFSM5M3AZXMQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

pzelasko commented 2 years ago

That's what keep_overlapping=False option is intended for (or filtering out all cuts with overlapping speech as a separate step) -- unless there is something else needed that I'm missing?

danpovey commented 2 years ago

mm, I guess I would have expected the cuts to overlap, so we can train on entire utterances. In ASR, part of a sup. segment is not so useful.

On Thursday, January 27, 2022, Piotr Żelasko @.***> wrote:

That's what keep_overlapping=False option is intended for (or filtering out all cuts with overlapping speech as a separate step) -- unless there is something else needed that I'm missing?

— Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/548#issuecomment-1023291241, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO2IFPOZ6VJGG57MFIDUYFL3VANCNFSM5M3AZXMQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

pzelasko commented 2 years ago

I think you mean a setup where trim_to_supervisions() would change:

|------cut---------|
|--s1---|
      |----s2--|

into

|------cut-----|
|--s1---|
      |----s2--|

instead of:

|----cut---|
|--s1---|
      |-s2-|

      |---cut--|
      |-s1-|
      |----s2--|

as it currently does. Is that right? (view in github to make sure the verticals line up correctly)

danpovey commented 2 years ago

Oh I'm sorry, I see now that I misinterpreted a figure, and that the cuts do overlap after all.

lhotse-speech / lhotse

Some problems when loading the TedLium3 dataset for transducer-stateless training #548