Open siddk opened 5 months ago
Hi ! that would be great :) Though note that datasets
doesn't implement format-specific resuming when streaming, so in general I think it's better if users can use the mosaic-streaming library to read their MDS datasets. I wonder if they support hf://
paths though...
Anyway for those interested, the code for WebDataset is a single file here: https://github.com/huggingface/datasets/blob/main/src/datasets/packaged_modules/webdataset/webdataset.py.
It implements _split_generators
that downloads files and returns the lists of splits (train/validation/test) and _split_generators
to generate examples (dicts) from the downloaded files. Streaming is automatically supported by making download steps lazy and by extending open()
to work with remote URLs.
Feature request
I'm a huge fan of the current HF Datasets
webdataset
integration (especially the built-in streaming support). However, I'd love to upload some robotics and multimodal datasets I've processed for use with Mosaic Streaming, specifically their MDS Format.Because the shard files have similar semantics to WebDataset, I'm hoping that adding such support won't be too much trouble?
Motivation
One of the downsides with WebDataset is a lack of out-of-the-box determinism (especially for large-scale training and reproducibility), easy job resumption, and the ability to quickly debug / visualize individual examples.
Mosaic Streaming provides a great interface for this out of the box, so I'd love to see it supported in HF Datasets.
Your contribution
Happy to help test things / provide example data. Can potentially submit a PR if maintainers could point me to the necessary WebDataset logic / steps for adding a new streaming format!