lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
904 stars 204 forks source link

Allow duplicate cut IDs in a CutSet (CutSet is list-like instead of dict-like) #1279

Closed pzelasko closed 5 months ago

pzelasko commented 5 months ago

Despite the title, this is not a breaking API change.

CutSet (and other sets) supported int and str based indexing almost since the start, but int-based indexing was inefficient (it just iterated the dict). Now position based indexing will be more efficient since it's the main usage pattern that I observed.

In addition, this change allows having duplicated cut IDs in the CutSet. Ever since we started introduced lazy manifests, this duplicated IDs were actually implicitly allowed, only as long as the manifest was lazy, since we only checked it for eager manifests. I've seen several use cases now where duplicated IDs are either expected, or at least not harmful - this pops us quite often in anything involving infinite cut sets. I'd rather have duplication checks being explicit in cases where they are required.