lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
936 stars 214 forks source link

fix_manifests function cost much time #1281

Closed xiangxyq closed 7 months ago

xiangxyq commented 8 months ago

Hi, I prepare my own data in TAG: 1.19.0, it is ok;

but when update code in TAG: 1.20.0, I found it cost too many time to prepare my data. check the code, the issue caused by fix_manifests function, but I don't know how to fix it.

part of my code:

with ProcessPoolExecutor(num_jobs) as ex:
    for (recording, segment) in tqdm(
        ex.map(
            parse_utterance,
            raw_manifests
        ),
        desc="Processing Corpus",
    ):
        manifests["recordings"].append(recording)
        manifests["supervisions"].append(segment)

recordings, supervisions = fix_manifests(
    recordings=RecordingSet.from_recordings(manifests["recordings"]),
    supervisions=SupervisionSet.from_segments(manifests["supervisions"]),
)
validate_recordings_and_supervisions(
    recordings=recordings, supervisions=supervisions
)

Thanks

Keith-Hon commented 8 months ago

same issue

pzelasko commented 7 months ago

Will try to look into it tomorrow

pzelasko commented 7 months ago

Please try again with PR https://github.com/lhotse-speech/lhotse/pull/1284