huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.97k stars 2.62k forks source link

Mosaic Streaming (MDS) Support #6736

Open siddk opened 5 months ago

siddk commented 5 months ago

Feature request

I'm a huge fan of the current HF Datasets webdataset integration (especially the built-in streaming support). However, I'd love to upload some robotics and multimodal datasets I've processed for use with Mosaic Streaming, specifically their MDS Format.

Because the shard files have similar semantics to WebDataset, I'm hoping that adding such support won't be too much trouble?

Motivation

One of the downsides with WebDataset is a lack of out-of-the-box determinism (especially for large-scale training and reproducibility), easy job resumption, and the ability to quickly debug / visualize individual examples.

Mosaic Streaming provides a great interface for this out of the box, so I'd love to see it supported in HF Datasets.

Your contribution

Happy to help test things / provide example data. Can potentially submit a PR if maintainers could point me to the necessary WebDataset logic / steps for adding a new streaming format!

lhoestq commented 5 months ago

Hi ! that would be great :) Though note that datasets doesn't implement format-specific resuming when streaming, so in general I think it's better if users can use the mosaic-streaming library to read their MDS datasets. I wonder if they support hf:// paths though...

Anyway for those interested, the code for WebDataset is a single file here: https://github.com/huggingface/datasets/blob/main/src/datasets/packaged_modules/webdataset/webdataset.py.

It implements _split_generators that downloads files and returns the lists of splits (train/validation/test) and _split_generators to generate examples (dicts) from the downloaded files. Streaming is automatically supported by making download steps lazy and by extending open() to work with remote URLs.