lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
936 stars 214 forks source link

Problem with CutSet.from_manifests #1240

Open juliendespres opened 9 months ago

juliendespres commented 9 months ago

Hi, I'm having a problem with the from_manifest function in the CutSet class.

I've decomposed a CutSet manifest using the CutSet.decompose() function so as to obtain the 3 files "features", "recordings" and "supervisions", with the aim of modifying the "supervision" file and then regenerating the CutSet file.

The problem occurs when I try to recompose these three files with the CutSet.from_manifests function, I get the following error : Traceback (most recent call last): File "local/recompose_manifest.py", line 97, in main() File "local/recompose_manifest.py", line 86, in main cut_set = CutSet.from_manifests(recordings=recordings, supervisions=supervisions, features=features) File "/home/despres/miniconda3/envs/k2_2312/lib/python3.8/site-packages/lhotse/cut/set.py", line 352, in from_manifests return create_cut_set_eager( File "/home/despres/miniconda3/envs/k2_2312/lib/python3.8/site-packages/lhotse/cut/set.py", line 3003, in create_cut_set_eager recording=recordings[feats.recording_id] if rec_ok else None, File "/home/despres/miniconda3/envs/k2_2312/lib/python3.8/site-packages/lhotse/audio/recording_set.py", line 389, in getitem return next( StopIteration

This function works without a problem if I pass any subset of only 2 files as parameters ("supervision+features", "features+recordings", "supervisions+recording").

Is it a bug, or is this function simply not designed for it?

If not, is there another way of regenerating this CutSet file without having to regenerate the features?

Thank you very much for your time.

pzelasko commented 9 months ago

I don't think decompose was ever tested in this way, although I would have expected it to work. I'm afraid I don't have enough time right now to look into it myself. Generally you should be able to create a CutSet from 2 components (e.g. features + supervisions) and then manually attach the third one (e.g. recordings) in a for loop. If you happen to find what is the issue, please share it with us.

juliendespres commented 9 months ago

Thank you for you response. I'm not sufficiently proficient in Python to do this kind of trick, but I ended up easily replacing the content of the text tag in the jsonl manifest with a simple perl script.

However, this feature seems to me to be essential to avoid having to regenerate features every time you change a comma in the supervision texts, and it would be interesting to be able to do this simply in future Lhotse developments.

pzelasko commented 9 months ago

Thanks, you're right. I'll keep the issue open for now.

RuABraun commented 8 months ago

I have the same issue. I'm doing this for the purpose of undoing trim_to_supervisions.

RuABraun commented 8 months ago

Seems to be because features doesn't have a recording_id (or anything else that knows what cut it was a part of).

pzelasko commented 8 months ago

Features does have recording_id field. If you can provide some way to reproduce with a small dataset like yesno or mini Librispeech I can look into it.