Update StreamingTextDataset to support truncation with multiple truncated items out.

mosaicml / llm-foundry

LLM training code for Databricks foundation models

https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm

Apache License 2.0

3.96k stars 519 forks source link

Update StreamingTextDataset to support truncation with multiple truncated items out. #1363

Closed LingxiaoShawn closed 2 weeks ago

LingxiaoShawn commented 1 month ago

🚀 Feature Request

The current StreamingTextDataset truncate the text/tokens to the max_seq_len directly and throw out all left text/tokens. It is possible to support the truncate the text/tokens to a multiple items each with max_seq_len? In this way if the input items have longer size, it won't be wasted. If this is not easy to support, can you mention a bit about the reason?

Thank you!

dakinggg commented 1 month ago

It is expected that the dataset is pretokenized, and any concatenation/wrapping is done at that time. e.g. the example script (https://github.com/mosaicml/llm-foundry/blob/54746bfbd82f3dba172c3c400cf1eb1799636792/llmfoundry/data/data.py#L149) has an option for whether to wrap to the next sample or not.

dakinggg commented 2 weeks ago

Closing as inactive.