mosaicml / streaming

A Data Streaming Library for Efficient Neural Network Training
https://streaming.docs.mosaicml.com
Apache License 2.0
1.09k stars 136 forks source link

Guide on using with HuggingFace accelerate/Trainer #789

Open shimizust opened 5 days ago

shimizust commented 5 days ago

🚀 Feature Request

Providing a guide on using StreamingDataset with HuggingFace accelerate and transformers.Trainer, if supported

Motivation

First, thanks for the great work on this! I attended your session at the Pytorch Conference. I wanted to try to use this, but I'm having trouble figuring out if this is compatible with the HuggingFace ecosystem (e.g accelerate for distributed training and the transformers trainer), which is being used pretty widely.

My understanding for HF-based training jobs is that a torch Dataset or IterableDataset is passed in to the Trainer. If accelerate is available, it will use accelerate to prepare the dataloader for distributed training. And in IterableDataset case, dataloading will occur on the first process 0 only, fetch all batches for all the processes, and broadcast the batches to each process.

I'm not sure if this is compatible with how StreamingDataset works.

XiaohanZhangCMU commented 4 days ago

@shimizust thanks for putting up the request. I quickly read through the accelerate's prepare_dataloader function, it seems like it supports both iterative dataset and mapstyle dataset. So streamingdataset should work with accelerate.

If you use accelerate_launch or torchrun, they would set the env vars. Have you tried that? What error do you see?

Feel free to post the findings/error message here, happy to help with troubleshooting.