lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
921 stars 210 forks source link

The mismatch between the marked duration and the actual audio duration in WenetSpeech #644

Open luomingshuang opened 2 years ago

luomingshuang commented 2 years ago

I am using k2 and Lhotse for wenetspeech ASR experiments. But there is an error happened. The error shows as follows:

image

And then I check the actual duration for this sample (its marked duration is 786.44s):

5305fba8604afa0e9cbb3a3ede5903f

I find the marked duration is 988.89s.

6675f17edaacbd75ec52064adb7de80

I also build an issue https://github.com/wenet-e2e/WenetSpeech/issues/33 for the WenetSpeech Group. If they can modify the marked duration for this sample, it will be ok. But if there is the same error for other datasets, how can we deal with it? So can we add a function to filter the wrong sample (also with a warning) in https://github1s.com/lhotse-speech/lhotse/blob/master/lhotse/audio.py#L203-L204.

pzelasko commented 2 years ago

Thanks for raising this issue. We should port fault_tolerant argument to Cut.Setcompute_and_store_features to handle these things properly. I'll try to do it but not sure when I'll find the time, steps which are needed are:

  1. Convert this code snippet to a regular for loop: https://github.com/lhotse-speech/lhotse/blob/bc74329bde080773abffb9da911f10cdc67bc7bb/lhotse/cut.py#L4455-L4465

  2. Use suppress_and_warn context manager to suppress audio loading related exceptions (example usage: https://github.com/lhotse-speech/lhotse/blob/bc74329bde080773abffb9da911f10cdc67bc7bb/lhotse/dataset/collation.py#L484-L490)

  3. Add fault_tolerant: bool = False option to CutSet.compute_and_store_features

luomingshuang commented 2 years ago

Cool! I think the above changes are useful for me.

luomingshuang commented 2 years ago

Hi, @pzelasko , do you update Lhotse for handling this issue?

luomingshuang commented 2 years ago

Or can we use a try....except... for filtering the bad sample in local/compute_xxx_fbank.py?

pzelasko commented 2 years ago

@luomingshuang can you test this PR? It seems to work but I don't have the means to test more thoroughly at the moment https://github.com/lhotse-speech/lhotse/pull/683

luomingshuang commented 2 years ago

Of course. I will test it right now.

luomingshuang commented 2 years ago

I am using this file to compute features. egs/wenetspeech/ASR/local/compute_fbank_wenetspeech_splits.py https://github.com/luomingshuang/icefall/blob/wenetspeech-pruned-transducer-stateless2/egs/wenetspeech/ASR/local/compute_fbank_wenetspeech_splits.py. I re-install the Lhotse based on this PR #683 . But it also happens a error as following: image

Should I also do some changes for the compute_fbank_wenetspeech_splits.py?

luomingshuang commented 2 years ago

Oh, I think it is another problem. The modified code can skip and ignore the duration mismatch.

luomingshuang commented 2 years ago

Or the reason for it is due to the index in getitem is over the self.num in init.

I print some things in lhotse/cut.py (the 5024 line) as follows:

    def __getitem__(self, cut_id_or_index: Union[int, str]) -> "Cut":
        if isinstance(cut_id_or_index, str):
            return self.cuts[cut_id_or_index]
        # ~100x faster than list(dict.values())[index] for 100k elements
        print(len(self.cuts), cut_id_or_index)
        return next(
            val for idx, val in enumerate(self.cuts.values()) if idx == cut_id_or_index
        )

And there are some results: image

From the above picture, we can find that when the bug is due to len(self.cuts)==0. I am not sure it is from the data or the modified codes. But I remember if I remove the mismatch sample, it can run normally when I use the original code.

csukuangfj commented 2 years ago

You have to check that

val for idx, val in enumerate(self.cuts.values()) if idx == cut_id_or_index

is not empty, I think.

luomingshuang commented 2 years ago

By following the suggestion from @csukuangfj , I add if len(cuts) == 0: continue into the line 4651 in cut.py from #683. After doing this, the bug StopIteration as refers above is fixed.

pzelasko commented 2 years ago

I updated https://github.com/lhotse-speech/lhotse/pull/683 to handle this case too now.