Manual Tests

Convert files of specific sizes

With changes: test-mds-conversion-500mb-rZDKAm Took 8 minutes test-mds-conversion-baseline-5gb-HRj4NE (ignore the fact that baseline is in the name lol) Took 1.5 hours

Without changes: test-mds-conversion-baseline-500mb-QJy3gj Hanging-- stopped after 2 days

Training 1 epoch with sec filings small dataset

Confirmed that the token counts are the same.

With changes: test-mds-conversion-mpt-7b-TRKKfd

Without changes: test-mds-conversion-baseline-mpt-7b-XTf6c2

Training 1 epoch with 5.5MB file

Confirmed that the token counts are the same.

With changes: test-mds-conversion-mpt-7b-5mb-UY6HtO

Without changes: test-mds-conversion-baseline-mpt-7b-5mb-71friv

mosaicml / llm-foundry

Chunk file reads and tokenization for text to mds conversion #1240

Manual Tests

Convert files of specific sizes

Training 1 epoch with sec filings small dataset

Training 1 epoch with 5.5MB file