Open siddk opened 4 months ago
Hey, this would be great! What did you have in mind regarding the implementation -- what should be done on Streaming's side?
It would be nice to stream datasets from HF using Streaming, e.g. supporting hf:// paths
@lhoestq Would it be possible for the user to upload the MDS shard files in the hf:// paths? Or is your ask to support the HF remote path with whatever underlying files it can contain, such as Parquet, JSONL, etc?
At HF we want to make the Hub more open and support more data formats and libraries. We recently added support for WebDataset for example, and there are hundreds of datasets in WebDataset format on the HF Hub already.
Users can already upload data files in MDS format that they have locally using e.g. huggingface_hub
. Maybe one day with the MDSWriter directly ? that would be cool !
Anyway what I think is the most interesting is if Streaming could stream datasets in MDS formats from HF (e.g. using hf://
paths). That would be useful to many researchers IMO
Just following up on this; @karan6181 @lhoestq -- my understanding is that the HF Hub exposes dataset repositories via an fsspec
API: https://huggingface.co/docs/huggingface_hub/main/en/guides/hf_file_system
From the Mosaic Streaming perspective -- can I just upload MDS shards to a Hub repo, and use the corresponding hf://
path as a drop-in replacement for an s3://
path?
Basically @karan6181 -- trying to figure out what "S3-compatible object store" really means under the hood vs. what the HF Hub natively supports.
Just following up on this; @karan6181 @lhoestq -- my understanding is that the HF Hub exposes dataset repositories via an fsspec API: https://huggingface.co/docs/huggingface_hub/main/en/guides/hf_file_system From the Mosaic Streaming perspective -- can I just upload MDS shards to a Hub repo, and use the corresponding hf:// path as a drop-in replacement for an s3:// path?
Yes that's correct !
Just following up on this; @karan6181 @lhoestq -- my understanding is that the HF Hub exposes dataset repositories via an
fsspec
API: https://huggingface.co/docs/huggingface_hub/main/en/guides/hf_file_systemFrom the Mosaic Streaming perspective -- can I just upload MDS shards to a Hub repo, and use the corresponding
hf://
path as a drop-in replacement for ans3://
path?Basically @karan6181 -- trying to figure out what "S3-compatible object store" really means under the hood vs. what the HF Hub natively supports.
@siddk It appears that the HF hub functions primarily as a cloud storage solution, accessible via the hf://
prefix. Integrating HF hub support into the streaming dataset should be straightforward. Do you have the capacity to implement HF hub backend support in the streaming dataset? You can model your work on the structure outlined in the PRs at https://github.com/mosaicml/streaming/pull/311 and https://github.com/mosaicml/streaming/pull/256. Please let us know if you have any questions—we're here to assist you.
Hey @karan6181 -- I'm a bit swamped with upcoming paper deadlines right now, but would love to see this supported. I can try carving out time to work on things in a few weeks, but wouldn't mind your expert take on this. I think the broader HF community would really appreciate it as well!
🚀 Feature Request
Hey folks - I've loved using
streaming
for some of my research in multimodal pretraining and robotics. One thing I'd love to support is first-class integration with HF Datasets (e.g., similar functionality to their WebDataset Streaming Integration).I've created an issue on HF Datasets here, and @lhoestq seems receptive to the idea. At a low-level, not sure about the best way to implement this support. Would pointers/to talk this through!
Motivation
Mosaic Streaming from MDS is fantastic for large-scale, reproducible pretraining! For some of my larger datasets, supporting the ability to stream MDS shards stored on HF Datasets while training would be fantastic.
Thanks!