Open luomingshuang opened 2 years ago
Now, when I transform the long cuts into the short cuts before computing fbank with the following codes, the GPU utilization gets higher and the training process gets normal:
cut_set = CutSet.from_manifests(
recordings=m["recordings"],
supervisions=m["supervisions"],
).trim_to_supervisions(keep_overlapping=False)
You basically discovered my intended use of these APIs the hard way :) but you're not the first person to ask these questions, which makes me wonder how I could improve this to make it less confusing for Lhotse users.
Can someone please explain to me:
what was going on here initially (why the error " Supervisions starting before the cut are not supported for ASR" happened)
Why it was slow when he did:
tedlium = TedLiumAsrDataModule(args)
train_cuts = tedlium.train_cuts()
train_cuts = train_cuts.trim_to_supervisions(keep_overlapping=False)
(and why it helped when he did trim_to_supervisions(keep_overlapping=False) before computing fbank),
and why we need "keep_overlapping=False" in the Tedlium setup. I would have thought overlapping supervisions would be rare since it is only one speaker, and it would be harmless to keep small overlaps.
Sure.
- what was going on here initially (why the error " Supervisions starting before the cut are not supported for ASR" happened)
When he ran .trim_to_supervisions()
, it creates one cut per each supervision present in the CutSet. By default, these cuts will contain all supervisions that are overlapping with them by at least 1% of duration. This is to make sure that the users are aware there is possibly overlapping speech in the data, and may either filter these cuts out, or use the flag keep_overlapping=False
, in which case there will only be one supervision per cut. I opted not to make the latter case default as it could be disastrous with corpora where overlap is common.
Why it was slow when he did:
tedlium = TedLiumAsrDataModule(args)
train_cuts = tedlium.train_cuts()
train_cuts = train_cuts.trim_to_supervisions(keep_overlapping=False)
This was slow because the result of tedlium.train_cuts()
contains long cuts (30min?) with a lot of supervisions. The current implementation of trim_to_supervisions
creates an interval tree of supervisions for each cut to "quickly" determine which ones are overlapping. Quite possibly it's not the fastest implementation we can get, but at least it's not quadratic. There might be some overhead from creating a lot of Python objects too, I'm not sure without a profile.
(and why it helped when he did trim_to_supervisions(keep_overlapping=False) before computing fbank),
It shifted the "cost" of trimming to supervisions to an earlier stage, so that when he runs the training scripts, he simply reads "precomputed trims" of cuts.
- and why we need "keep_overlapping=False" in the Tedlium setup. I would have thought overlapping supervisions would be rare since it is only one speaker, and it would be harmless to keep small overlaps.
It's not so easy -- if an overlapping supervision goes "outside" of the cut, we are missing a part of audio that may correspond to some text, so we'd be introducing bad training examples. This can be fixed by extending the cut to cover full overlapping supervision (I don't think we have a method for this yet).
Unfortunately, all it takes is one bad cut to get into these issues, unless we check for these things explicitly in the data prep scripts rather than in K2SpeechRecognitionDataset.
Why not just let the cuts themselves overlap and keep just one supervision per cut, with no overlap detection?
On Thursday, January 27, 2022, Piotr Żelasko @.***> wrote:
Sure.
- what was going on here initially (why the error " Supervisions starting before the cut are not supported for ASR" happened)
When he ran .trim_to_supervisions(), it creates one cut per each supervision present in the CutSet. By default, these cuts will contain all supervisions that are overlapping with them by at least 1% of duration. This is to make sure that the users are aware there is possibly overlapping speech in the data, and may either filter these cuts out, or use the flag keep_overlapping=False, in which case there will only be one supervision per cut. I opted not to make the latter case default as it could be disastrous with corpora where overlap is common.
- Why it was slow when he did:
tedlium = TedLiumAsrDataModule(args)
train_cuts = tedlium.train_cuts()
train_cuts = train_cuts.trim_to_supervisions(keep_overlapping=False)
This was slow because the result of tedlium.train_cuts() contains long cuts (30min?) with a lot of supervisions. The current implementation of trim_to_supervisions creates an interval tree of supervisions for each cut to "quickly" determine which ones are overlapping. Quite possibly it's not the fastest implementation we can get, but at least it's not quadratic. There might be some overhead from creating a lot of Python objects too, I'm not sure without a profile.
(and why it helped when he did trim_to_supervisions(keep_overlapping=False) before computing fbank),
It shifted the "cost" of trimming to supervisions to an earlier stage, so that when he runs the training scripts, he simply reads "precomputed trims" of cuts.
- and why we need "keep_overlapping=False" in the Tedlium setup. I would have thought overlapping supervisions would be rare since it is only one speaker, and it would be harmless to keep small overlaps.
It's not so easy -- if an overlapping supervision goes "outside" of the cut, we are missing a part of audio that may correspond to some text, so we'd be introducing bad training examples. This can be fixed by extending the cut to cover full overlapping supervision (I don't think we have a method for this yet).
Unfortunately, all it takes is one bad cut to get into these issues, unless we check for these things explicitly in the data prep scripts rather than in K2SpeechRecognitionDataset.
— Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/548#issuecomment-1022847833, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO4IGWEUSBF25GIEEFLUYDF4TANCNFSM5M3AZXMQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you commented.Message ID: @.***>
Think of diarization: if you want to detect overlapping turns, it makes more sense to me when you have that information in a single training example (cut) encoded as multiple (overlapping) supervisions.
Mm yes, but surely it wouldn't hurt to add another mode intended for ASR...
On Thursday, January 27, 2022, Piotr Żelasko @.***> wrote:
Think of diarization: if you want to detect overlapping turns, it makes more sense to me when you have that information in a single training example (cut) encoded as multiple (overlapping) supervisions.
— Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/548#issuecomment-1023255670, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOYDBWQOEA3NAS77WGLUYFHUZANCNFSM5M3AZXMQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you commented.Message ID: @.***>
That's what keep_overlapping=False
option is intended for (or filtering out all cuts with overlapping speech as a separate step) -- unless there is something else needed that I'm missing?
mm, I guess I would have expected the cuts to overlap, so we can train on entire utterances. In ASR, part of a sup. segment is not so useful.
On Thursday, January 27, 2022, Piotr Żelasko @.***> wrote:
That's what keep_overlapping=False option is intended for (or filtering out all cuts with overlapping speech as a separate step) -- unless there is something else needed that I'm missing?
— Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/548#issuecomment-1023291241, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO2IFPOZ6VJGG57MFIDUYFL3VANCNFSM5M3AZXMQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you commented.Message ID: @.***>
I think you mean a setup where trim_to_supervisions()
would change:
|------cut---------|
|--s1---|
|----s2--|
into
|------cut-----|
|--s1---|
|----s2--|
instead of:
|----cut---|
|--s1---|
|-s2-|
|---cut--|
|-s1-|
|----s2--|
as it currently does. Is that right? (view in github to make sure the verticals line up correctly)
Oh I'm sorry, I see now that I misinterpreted a figure, and that the cuts do overlap after all.
Currently, I am trying to build a transducer-stateless recipe based on Tedlium3 for icefall. This is the PR. (https://github.com/k2-fsa/icefall/pull/183). This PR shows the concrete codes for processing and loading the tedlium dataset. In train.py, we also use the function
remove_short_and_long_utt
for filtering.When I use the following codes for loading the data in train.py:
There is an error:
So I read the cuts_train.json.gz with
vim data/fbank/cuts_train.json.gz
, and shows as follows:As shows in the above picture, the each cut's duration is too long. So the samples are filtered.
For fixing this issue, I try to use the following codes for loading the data:
There is an error:
As shows in the above picture, some short cut's
start
time may be negative (<0).For fixing this issue, I try to use the following codes for loading the data:
It can load data with the above codes normally. But it takes much time to generate a batch in GPU. And the volatile GPU-Util is 0 for a long time.
I am trying to change long cuts to short cuts before computing the fbank feature. (In compute_fbank_tedlium.py)
Are there some advices for this issue? Thanks!