lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
902 stars 204 forks source link

How to combine with huggingface audio datasets? #1366

Open yuekaizhang opened 3 days ago

yuekaizhang commented 3 days ago
from datasets import load_dataset
ds = load_dataset(
    "speechcolab/gigaspeech",
    "xl",
    split="train",
    trust_remote_code=True,
    streaming=True,
)

As shown in the code snippet above, we can utilize the GigaSpeech dataset without downloading it to local machines by setting streaming=True. I am interested in combining the Hugging Face streaming datasets feature with Lhotse functionalities, such as the Dynamic Sampler.

I noticed there are features in Lhotse like

    >>> cuts = LazySharIterator({
    ...     "cuts": ["pipe:curl https://my.page/cuts.000000.jsonl.gz"],
    ...     "recording": ["pipe:curl https://my.page/recording.000000.tar"],
    ... })

However, the Hugging Face datasets are formatted differently, such as: https://huggingface.co/datasets/speechcolab/gigaspeech/blob/main/data/audio/m_files_additional/m_chunks_0000.tar.gz

I am looking for a way to integrate these two approaches effectively. Let me know if you need any further adjustments!

pzelasko commented 1 day ago

Hi Yuekai,

It would be nice to have a HF dataset adapter for Lhotse. We may call it HFDatasetIterator. Since HF datasets don't provide a common schema for every dataset, we need to support user-defined mapping from the items in HF example to fields in lhotse Cut.

pseudo code:


class HFDatasetIterator:
    def __init__(self, *hf_dataset_or_args, field_map: Dict[str, str], **hf_kwargs) -> None:
        self.dataset = hf_dataset_or_args
        self.hf_kwargs = hf_kwargs
        self.field_map = field_map

    def __iter__(self):
        from datasets import Dataset, load_dataset

        if len(self.dataset) == 1 and isinstance(self.dataset[0], Dataset):
            dataset = self.dataset[0]
        else:
            dataset = load_dataset(*self.dataset, **hf_kwargs)

        for example in dataset:
            cut = create_cut(example)
            for field in example:
                tgt_field = self.field_map[field]
                update_field(cut, tgt_field, field)
            yield cut

then expose new CutSet constructor

from datasets import load_dataset
ds = load_dataset(...)
cuts = CutSet.from_huggingface(ds, field_map=field_map)

# alternatively
cuts = CutSet.from_huggingface("speechcolab/gigaspeech", "xl", split="train", ..., field_map=field_map)

It would be good to use as much of the HF dataset integration into NeMo datasets as much as possible to simplify the later integration https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Transducers_with_HF_Datasets.ipynb

Let me know if you'd like to contribute that, otherwise I'll try to find some time later.