Tokenization is significantly slower on long sequences.
Loading extremely large files into memory at once can cause OOMs
Manual Tests
Convert files of specific sizes
With changes:
test-mds-conversion-500mb-rZDKAm Took 8 minutes
test-mds-conversion-baseline-5gb-HRj4NE (ignore the fact that baseline is in the name lol) Took 1.5 hours
Without changes:
test-mds-conversion-baseline-500mb-QJy3gj Hanging-- stopped after 2 days
Training 1 epoch with sec filings small dataset
Confirmed that the token counts are the same.
With changes:
test-mds-conversion-mpt-7b-TRKKfd
Without changes:
test-mds-conversion-baseline-mpt-7b-XTf6c2
Training 1 epoch with 5.5MB file
Confirmed that the token counts are the same.
With changes:
test-mds-conversion-mpt-7b-5mb-UY6HtO
Without changes:
test-mds-conversion-baseline-mpt-7b-5mb-71friv
Read files and tokenize in 1MB chunks.
Addresses two issues:
Manual Tests
Convert files of specific sizes
With changes:
test-mds-conversion-500mb-rZDKAm
Took 8 minutestest-mds-conversion-baseline-5gb-HRj4NE
(ignore the fact that baseline is in the name lol) Took 1.5 hoursWithout changes:
test-mds-conversion-baseline-500mb-QJy3gj
Hanging-- stopped after 2 daysTraining 1 epoch with sec filings small dataset
Confirmed that the token counts are the same.
With changes:
test-mds-conversion-mpt-7b-TRKKfd
Without changes:
test-mds-conversion-baseline-mpt-7b-XTf6c2
Training 1 epoch with 5.5MB file
Confirmed that the token counts are the same.
With changes:
test-mds-conversion-mpt-7b-5mb-UY6HtO
Without changes:
test-mds-conversion-baseline-mpt-7b-5mb-71friv