Open chagri opened 8 months ago
Thank you @chagri for the detailed issue! But I don't see how would tokenizing data be the bottleneck while training?
I agree though that it'd to great to support Data Streaming from cloud to avoid storing huge datasets locally. Should be supported soon :)
Hey! I'm jumping on this issue because I had a similar question :) In short: data streaming from local files is not supported either, right? Essentially, the dataset is tokenized once here and stored in cache files. In certain cases, it might be quite wasteful to both store the raw dataset + the tokenized version -- do you have such local streaming (with tokenization on the fly) on your roadmap?
Training large models like 10s or 100s of billions of parameter models require several optimizations. One of the key aspect is tokenize data in advance and then stream during training. Something similar to what MosaicML does with their Streaming and Composer.
Request to either connect with MosaicML Streaming (open source) or connect with HF data streaming for tokenize in advance and then use it fur training LLMs. This minimizes run time GPU training time and potential failures.