mosaicml / llm-foundry

LLM training code for Databricks foundation models
https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
Apache License 2.0
3.84k stars 503 forks source link

Chunk file reads and tokenization for text to mds conversion #1240

Closed irenedea closed 1 month ago

irenedea commented 1 month ago

Read files and tokenize in 1MB chunks.

Addresses two issues:

  1. Tokenization is significantly slower on long sequences.
  2. Loading extremely large files into memory at once can cause OOMs

Manual Tests

Convert files of specific sizes

With changes: test-mds-conversion-500mb-rZDKAm Took 8 minutes test-mds-conversion-baseline-5gb-HRj4NE (ignore the fact that baseline is in the name lol) Took 1.5 hours

Without changes: test-mds-conversion-baseline-500mb-QJy3gj Hanging-- stopped after 2 days

Training 1 epoch with sec filings small dataset

Confirmed that the token counts are the same.

With changes: test-mds-conversion-mpt-7b-TRKKfd

Without changes: test-mds-conversion-baseline-mpt-7b-XTf6c2

Training 1 epoch with 5.5MB file

Confirmed that the token counts are the same.

With changes: test-mds-conversion-mpt-7b-5mb-UY6HtO

Without changes: test-mds-conversion-baseline-mpt-7b-5mb-71friv

irenedea commented 1 month ago

@mvpatel2000 Tokenizing large strings at once is really slow. HF tokenizers are typically optimized to shorter strings: https://github.com/huggingface/transformers/issues/25873#issuecomment-1701727782