lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
908 stars 205 forks source link

Duplicate manifest id after mixing #1265

Closed MarcoMultichannel closed 5 months ago

MarcoMultichannel commented 5 months ago

Hello, Using lhotse 1.18 everything works fine, but since the 1.19 there's a problem that I discovered while trying the zipformer recipe in icefall, in particular during the mixing phase with Musan samples.

The assertion fails and this is the output:

batch = train_dl.dataset[cuts]
  File "/home/marco/icefall_test/.venv/lib/python3.10/site-packages/lhotse/dataset/speech_recognition.py", line 109, in __getitem__
    cuts = tnfm(cuts)
  File "/home/marco/icefall_test/.venv/lib/python3.10/site-packages/lhotse/dataset/cut_transforms/mix.py", line 70, in __call__
    ).to_eager()
  File "/home/marco/icefall_test/.venv/lib/python3.10/site-packages/lhotse/serialization.py", line 380, in to_eager
    return cls.from_items(self)
  File "/home/marco/icefall_test/.venv/lib/python3.10/site-packages/lhotse/cut/set.py", line 310, in from_cuts
    return CutSet(cuts=index_by_id_and_check(cuts))
  File "/home/marco/icefall_test/.venv/lib/python3.10/site-packages/lhotse/utils.py", line 710, in index_by_id_and_check
    assert m.id not in id2man, f"Duplicated manifest ID: {m.id}"
AssertionError: Duplicated manifest ID: <MANIFEST_ID>

The problem isn't related to my manifests, since it works if you use lhotse 1.18.

This is the code where the transform is added:

cuts_musan = load_manifest(self.args.manifest_dir / "musan_cuts.jsonl.gz")
transforms.append(CutMix(cuts=cuts_musan, p=0.5, snr=(10, 20), preserve_id=True))

Setting preserve_id to False also solves the issue.

pzelasko commented 5 months ago

Thanks for reporting, this is the same issue as #1267 which is now resolved via #1268

Mahaotian1 commented 5 months ago

I have met thie problem just now, assert m.id not in id2man, f"Duplicated manifest ID: {m.id}" AssertionError: Duplicated manifest ID: roots_29_morris_0109-8426 But I have a question, I met this problem when I was training 8th epoch. I have not change any things of the cutset and the h5 file. But why it happened suddenly, the version of the lhotse is "1.20.0.dev+git.b3373c0.clean"

pzelasko commented 5 months ago

I just merged it ~1h ago — you’d need to pip uninstall lhotse and then pip install git+https://github.com/lhotse-speech/lhotse to get this fix. I intend to release a new version of lhotse to pip with the fix soon.

Mahaotian1 commented 5 months ago

I just merged it ~1h ago — you’d need to pip uninstall lhotse and then pip install git+https://github.com/lhotse-speech/lhotse to get this fix. I intend to release a new version of lhotse to pip with the fix soon.

Is that the same question as above?