lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
944 stars 216 forks source link

Exception when create DataLoader for wenetspeech dataset #470

Closed drawfish closed 2 years ago

drawfish commented 2 years ago

After generate recording.jsonl.gz and supervison.jsonl.gz from kaldi data directory with command: lhost kaldi import <kaldi-data-dir> <samplerate> <lhotse-data-dir> I creat a dataset from these two manifests with:

recordings = load_manifest(self.args.data_dir / "wenetspeech-manifests" / "recordings.jsonl.gz")
supervisions = load_manifest(self.args.data_dir / "wenetspeech-manifests" / "supervisions.jsonl.gz")
cuts_train = CutSet.from_manifests(recordings=recordings, supervisions=supervisions)\
    .filter_supervisions(lambda s: s.duration >= 2.0 and s.duration <= 30.0 )\
    .trim_to_supervisions(num_jobs=64)

and then I construct datasampler and dataloader with:

train_sampler = SingleCutSampler(
    cuts_train,
    max_duration=self.args.max_duration,
    shuffle=self.args.shuffle,
)
train = K2SpeechRecognitionDataset(
    cut_transforms=[],
    input_transforms=[],
    return_cuts=self.args.return_cuts,
)
train_dl = DataLoader(
    train,
    sampler=train_sampler,
    batch_size=None,
    num_workers=self.args.num_workers,
    persistent_workers=False,
)

however, when iterate through the train_dl, an assertionerror exception occurs:

Traceback (most recent call last):
  File "conformer/data/asr_datamodule.py", line 338, in <module>
    AsrDataModule.add_arguments(parser)
  File "/miniconda3/envs/k2/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/miniconda3/envs/k2/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1085, in _next_data
    return self._process_data(data)
  File "/miniconda3/envs/k2/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1111, in _process_data
    data.reraise()
  File "/miniconda3/envs/k2/lib/python3.8/site-packages/torch/_utils.py", line 428, in reraise
    raise self.exc_type(msg)
AssertionError: Caught AssertionError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/miniconda3/envs/k2/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
    data = fetcher.fetch(index)
  File "/miniconda3/envs/k2/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 46, in fetch
    data = self.dataset[possibly_batched_index]
  File "/k2/lhotse/lhotse/dataset/speech_recognition.py", line 93, in __getitem__
    validate_for_asr(cuts)
  File "/k2/lhotse/lhotse/dataset/speech_recognition.py", line 191, in validate_for_asr
    assert supervision.start >= -tol, (
AssertionError: Supervisions starting before the cut are not supported for ASR (sup id: Y0000008176_KtzfOHuuzd8_S00691, cut id: 04167f41-aea7-40a4-a285-4d381dc37e79)

Then I export out the Cutset information into jsonl.gz file and the filter out the Cut of the error supervision id:

{
"id": "04167f41-aea7-40a4-a285-4d381dc37e79", 
"start": 2056.32, 
"duration": 2.72, 
"channel": 0, 
"supervisions": 
[
  {
    "id": "Y0000008176_KtzfOHuuzd8_S00691", 
    "recording_id": "Y0000008176_KtzfOHuuzd8", 
    "start": -3.16, 
    "duration": 3.2, 
    "channel": 0, 
    "text": "\u7ed9 \u6211\u4eec \u51fa\u5177 \u7684 \u4ea7\u54c1\u8d28\u91cf \u4fdd\u9669"
  }, 
  {
    "id": "Y0000008176_KtzfOHuuzd8_S00692", 
    "recording_id": "Y0000008176_KtzfOHuuzd8", 
    "start": 0.0, 
    "duration": 2.72, 
    "channel": 0, 
    "text": "\u8fd9 \u662f \u91d1 \u5b87 \u6807 \u4eca\u5929 \u521a \u529e \u4e0b\u6765 \u7684"
  }
], 
"recording": {
  "id": "Y0000008176_KtzfOHuuzd8", 
  "sources": 
    [{"type": "command", 
    "channels": [0], 
    "source": "sox -V1 /data/WenetSpeech/data/audio/train/youtube/B00031/Y0000008176_KtzfOHuuzd8.opus -t wav -r 8000 -b 16 -c 1 -"
    }], 
  "sampling_rate": 8000, 
  "num_samples": 21799840, 
  "duration": 2724.98
}
, "type": "MonoCut"
}

From the information of supervisions id: "Y0000008176_KtzfOHuuzd8_S00691", we can see that the start time of it is negative which triggered the exception. The segments file of kaldi data directory:

.....
Y0000008176_KtzfOHuuzd8_S00690  Y0000008176_KtzfOHuuzd8 2051.12 2052.84
Y0000008176_KtzfOHuuzd8_S00691  Y0000008176_KtzfOHuuzd8 2053.16 2056.36
Y0000008176_KtzfOHuuzd8_S00692  Y0000008176_KtzfOHuuzd8 2056.32 2059.04
Y0000008176_KtzfOHuuzd8_S00693  Y0000008176_KtzfOHuuzd8 2059.92 2061.2
.....

The line in lhotse data directory of recordings.jsonl.gz :

{"id": "Y0000008176_KtzfOHuuzd8", "sources": [{"type": "command", "channels": [0], "source": "sox -V1 /WenetSpeech/data/audio/train/youtube/B00031 Y0000008176_KtzfOHuuzd8.opus -t wav -r 8000 -b 16 -c 1 -"}], "sampling_rate": 8000, "num_samples": 21799840, "duration": 2724.98}

The line in lhotse data directory of supervisions.jsonl.gz :

...
{"id": "Y0000008176_KtzfOHuuzd8_S00690", "recording_id": "Y0000008176_KtzfOHuuzd8", "start": 2051.12, "duration": 1.72, "channel": 0, "text": "\u8fd9 \u662f \u4fdd\u9669\u516c\u53f8"}
{"id": "Y0000008176_KtzfOHuuzd8_S00691", "recording_id": "Y0000008176_KtzfOHuuzd8", "start": 2053.16, "duration": 3.2, "channel": 0, "text": "\u7ed9 \u6211\u4eec \u51fa\u5177 \u7684 \u4ea7\u54c1\u8d28\u91cf \u4fdd\u9669"}
{"id": "Y0000008176_KtzfOHuuzd8_S00692", "recording_id": "Y0000008176_KtzfOHuuzd8", "start": 2056.32, "duration": 2.72, "channel": 0, "text": "\u8fd9 \u662f \u91d1 \u5b87 \u6807 \u4eca\u5929 \u521a \u529e \u4e0b\u6765 \u7684"}
{"id": "Y0000008176_KtzfOHuuzd8_S00693", "recording_id": "Y0000008176_KtzfOHuuzd8", "start": 2059.92, "duration": 1.28, "channel": 0, "text": "\u4f60\u4eec \u603b\u7f16 \u8bf4 \u4e86"}
...

My question is how such negative start time was created and how to modify the configuration of function "trim_to_supervisions" to correct it?

pzelasko commented 2 years ago

The reason is that you have overlapping segments. See explanation here https://lhotse.readthedocs.io/en/latest/api.html#lhotse.cut.CutSet.trim_to_supervisions You want to pass keep_overlapping=False to trim_to_supervisions.

pzelasko commented 2 years ago

BTW the negative time is an indication that a segment started before the start of the cut. It is useful if you’re explicitly trying to model overlapped speech and do something about it.

drawfish commented 2 years ago

The problem has been fixed. Thanks~