Data Mixture training with RedPajama Dataset

Ivan-Zhou commented 9 months ago

I have been trying to train Llama2 with RedPajama through Data Mixture for good randomization. Out of box it doesn't work: the cluster keeps waiting for some chunks, while a few millions docs have been processed at every workers:

1442 2024-01-29T00:42:37 - 0 - levanter.data.shard_cache - shard_cache.py:1326 - WARNING :: Waiting for chunk 0 after 2442 seconds
1443 2024-01-29T00:42:47 - 0 - preprocessing.train - shard_cache.py:569 - INFO ::  done: Shards: 0 | Chunks: 413 | Docs: 1691648
1444 2024-01-29T00:42:47 - 0 - preprocessing.train - shard_cache.py:569 - INFO ::  done: Shards: 0 | Chunks: 324 | Docs: 1327104
1445 2024-01-29T00:42:55 - 0 - preprocessing.train - shard_cache.py:569 - INFO ::  done: Shards: 0 | Chunks: 325 | Docs: 1331200
1446 2024-01-29T00:42:57 - 0 - levanter.data.shard_cache - shard_cache.py:1326 - WARNING :: Waiting for chunk 0 after 2462 seconds

I did some experimentation with different combination of subsets, then I narrowed down the errors:

✅ Training with one dataset (wikipedia): training while preparing data caching - https://wandb.ai/stanford-mercury/levanter/runs/we9fel2wxjo30128
✅ Training with a mix of one dataset (wikipedia): https://wandb.ai/stanford-mercury/levanter/runs/we9fel2wl20o30128
✅ Training with a mix of two dataset (wiki, stack): https://wandb.ai/stanford-mercury/levanter/runs/we9fel22ico30128/
❌ Training with a mix of three datasets (wiki, stack, book): https://wandb.ai/stanford-mercury/levanter/runs/we9felk2ico30129/
❌ Training with a mix of only book dataset: https://wandb.ai/stanford-mercury/levanter/runs/we9felico30129 - only waiting. No tokenization

The book dataset seems be the one blocker. The issue does not appear with RP-Wiki and RP-Stack, but only happens for RP-book.

Then I keep experimenting:

✅ Training with a mix of only arxiv dataset: https://wandb.ai/stanford-mercury/levanter/runs/we9felco30129
✅ Training with a mix of c4 dataset: https://wandb.ai/stanford-mercury/levanter/runs/we9felc4o30129
✅ Training with a mix of cc dataset: https://wandb.ai/stanford-mercury/levanter/runs/we9felcco30129
✅ Train with a mix of 5 datasets (wiki, stack, GitHub, arXiv, c4): https://wandb.ai/stanford-mercury/levanter/runs/we9felc4llo30130

The last job ran for 1K steps. It looks smooth and good. This is what we are looking for with data mixture!

Not sure what is wrong with the book dataset. I want to document my finding and then investigate.

Ivan-Zhou commented 9 months ago

This is my data config: https://github.com/stanford-crfm/levanter/blob/background-job/config/data/rpv1_llama.yaml

dlwh commented 9 months ago

(try main again)

stanford-crfm / levanter

Data Mixture training with RedPajama Dataset #442