huggingface / nanotron

Minimalistic large language model 3D-parallelism training
Apache License 2.0
1.14k stars 107 forks source link

[Feature Request] Support Data Streaming for faster training of large models #45

Open chagri opened 8 months ago

chagri commented 8 months ago

Training large models like 10s or 100s of billions of parameter models require several optimizations. One of the key aspect is tokenize data in advance and then stream during training. Something similar to what MosaicML does with their Streaming and Composer.

Request to either connect with MosaicML Streaming (open source) or connect with HF data streaming for tokenize in advance and then use it fur training LLMs. This minimizes run time GPU training time and potential failures.

NouamaneTazi commented 8 months ago

Thank you @chagri for the detailed issue! But I don't see how would tokenizing data be the bottleneck while training?

I agree though that it'd to great to support Data Streaming from cloud to avoid storing huge datasets locally. Should be supported soon :)

haeggee commented 7 months ago

Hey! I'm jumping on this issue because I had a similar question :) In short: data streaming from local files is not supported either, right? Essentially, the dataset is tokenized once here and stored in cache files. In certain cases, it might be quite wasteful to both store the raw dataset + the tokenized version -- do you have such local streaming (with tokenization on the fly) on your roadmap?