mosaicml / streaming

A Data Streaming Library for Efficient Neural Network Training
https://streaming.docs.mosaicml.com
Apache License 2.0
1.04k stars 128 forks source link

Integrating MDS Streaming with HF Dataset Streaming #633

Open siddk opened 4 months ago

siddk commented 4 months ago

🚀 Feature Request

Hey folks - I've loved using streaming for some of my research in multimodal pretraining and robotics. One thing I'd love to support is first-class integration with HF Datasets (e.g., similar functionality to their WebDataset Streaming Integration).

I've created an issue on HF Datasets here, and @lhoestq seems receptive to the idea. At a low-level, not sure about the best way to implement this support. Would pointers/to talk this through!

Motivation

Mosaic Streaming from MDS is fantastic for large-scale, reproducible pretraining! For some of my larger datasets, supporting the ability to stream MDS shards stored on HF Datasets while training would be fantastic.

Thanks!

snarayan21 commented 4 months ago

Hey, this would be great! What did you have in mind regarding the implementation -- what should be done on Streaming's side?

lhoestq commented 4 months ago

It would be nice to stream datasets from HF using Streaming, e.g. supporting hf:// paths

karan6181 commented 3 months ago

@lhoestq Would it be possible for the user to upload the MDS shard files in the hf:// paths? Or is your ask to support the HF remote path with whatever underlying files it can contain, such as Parquet, JSONL, etc?

lhoestq commented 3 months ago

At HF we want to make the Hub more open and support more data formats and libraries. We recently added support for WebDataset for example, and there are hundreds of datasets in WebDataset format on the HF Hub already.

Users can already upload data files in MDS format that they have locally using e.g. huggingface_hub. Maybe one day with the MDSWriter directly ? that would be cool !

Anyway what I think is the most interesting is if Streaming could stream datasets in MDS formats from HF (e.g. using hf:// paths). That would be useful to many researchers IMO

siddk commented 3 months ago

Just following up on this; @karan6181 @lhoestq -- my understanding is that the HF Hub exposes dataset repositories via an fsspec API: https://huggingface.co/docs/huggingface_hub/main/en/guides/hf_file_system

From the Mosaic Streaming perspective -- can I just upload MDS shards to a Hub repo, and use the corresponding hf:// path as a drop-in replacement for an s3:// path?

Basically @karan6181 -- trying to figure out what "S3-compatible object store" really means under the hood vs. what the HF Hub natively supports.

lhoestq commented 3 months ago

Just following up on this; @karan6181 @lhoestq -- my understanding is that the HF Hub exposes dataset repositories via an fsspec API: https://huggingface.co/docs/huggingface_hub/main/en/guides/hf_file_system From the Mosaic Streaming perspective -- can I just upload MDS shards to a Hub repo, and use the corresponding hf:// path as a drop-in replacement for an s3:// path?

Yes that's correct !

karan6181 commented 3 months ago

Just following up on this; @karan6181 @lhoestq -- my understanding is that the HF Hub exposes dataset repositories via an fsspec API: https://huggingface.co/docs/huggingface_hub/main/en/guides/hf_file_system

From the Mosaic Streaming perspective -- can I just upload MDS shards to a Hub repo, and use the corresponding hf:// path as a drop-in replacement for an s3:// path?

Basically @karan6181 -- trying to figure out what "S3-compatible object store" really means under the hood vs. what the HF Hub natively supports.

@siddk It appears that the HF hub functions primarily as a cloud storage solution, accessible via the hf:// prefix. Integrating HF hub support into the streaming dataset should be straightforward. Do you have the capacity to implement HF hub backend support in the streaming dataset? You can model your work on the structure outlined in the PRs at https://github.com/mosaicml/streaming/pull/311 and https://github.com/mosaicml/streaming/pull/256. Please let us know if you have any questions—we're here to assist you.

siddk commented 3 months ago

Hey @karan6181 -- I'm a bit swamped with upcoming paper deadlines right now, but would love to see this supported. I can try carving out time to work on things in a few weeks, but wouldn't mind your expert take on this. I think the broader HF community would really appreciate it as well!