Open yuekaizhang opened 3 days ago
Hi Yuekai,
It would be nice to have a HF dataset adapter for Lhotse. We may call it HFDatasetIterator
. Since HF datasets don't provide a common schema for every dataset, we need to support user-defined mapping from the items in HF example to fields in lhotse Cut.
pseudo code:
class HFDatasetIterator:
def __init__(self, *hf_dataset_or_args, field_map: Dict[str, str], **hf_kwargs) -> None:
self.dataset = hf_dataset_or_args
self.hf_kwargs = hf_kwargs
self.field_map = field_map
def __iter__(self):
from datasets import Dataset, load_dataset
if len(self.dataset) == 1 and isinstance(self.dataset[0], Dataset):
dataset = self.dataset[0]
else:
dataset = load_dataset(*self.dataset, **hf_kwargs)
for example in dataset:
cut = create_cut(example)
for field in example:
tgt_field = self.field_map[field]
update_field(cut, tgt_field, field)
yield cut
then expose new CutSet constructor
from datasets import load_dataset
ds = load_dataset(...)
cuts = CutSet.from_huggingface(ds, field_map=field_map)
# alternatively
cuts = CutSet.from_huggingface("speechcolab/gigaspeech", "xl", split="train", ..., field_map=field_map)
It would be good to use as much of the HF dataset integration into NeMo datasets as much as possible to simplify the later integration https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Transducers_with_HF_Datasets.ipynb
Let me know if you'd like to contribute that, otherwise I'll try to find some time later.
As shown in the code snippet above, we can utilize the GigaSpeech dataset without downloading it to local machines by setting streaming=True. I am interested in combining the Hugging Face streaming datasets feature with Lhotse functionalities, such as the Dynamic Sampler.
I noticed there are features in Lhotse like
However, the Hugging Face datasets are formatted differently, such as: https://huggingface.co/datasets/speechcolab/gigaspeech/blob/main/data/audio/m_files_additional/m_chunks_0000.tar.gz
I am looking for a way to integrate these two approaches effectively. Let me know if you need any further adjustments!