stanford-crfm / levanter

Legible, Scalable, Reproducible Foundation Models with Named Tensors and Jax
https://levanter.readthedocs.io/en/latest/
Apache License 2.0
516 stars 81 forks source link

Data Mixture training with RedPajama Dataset #442

Closed Ivan-Zhou closed 5 months ago

Ivan-Zhou commented 9 months ago

I have been trying to train Llama2 with RedPajama through Data Mixture for good randomization. Out of box it doesn't work: the cluster keeps waiting for some chunks, while a few millions docs have been processed at every workers:

1442 2024-01-29T00:42:37 - 0 - levanter.data.shard_cache - shard_cache.py:1326 - WARNING :: Waiting for chunk 0 after 2442 seconds
1443 2024-01-29T00:42:47 - 0 - preprocessing.train - shard_cache.py:569 - INFO ::  done: Shards: 0 | Chunks: 413 | Docs: 1691648
1444 2024-01-29T00:42:47 - 0 - preprocessing.train - shard_cache.py:569 - INFO ::  done: Shards: 0 | Chunks: 324 | Docs: 1327104
1445 2024-01-29T00:42:55 - 0 - preprocessing.train - shard_cache.py:569 - INFO ::  done: Shards: 0 | Chunks: 325 | Docs: 1331200
1446 2024-01-29T00:42:57 - 0 - levanter.data.shard_cache - shard_cache.py:1326 - WARNING :: Waiting for chunk 0 after 2462 seconds

I did some experimentation with different combination of subsets, then I narrowed down the errors:

The book dataset seems be the one blocker. The issue does not appear with RP-Wiki and RP-Stack, but only happens for RP-book.

Then I keep experimenting:

The last job ran for 1K steps. It looks smooth and good. This is what we are looking for with data mixture!

Screenshot 2024-01-31 at 7 57 11 PM

Not sure what is wrong with the book dataset. I want to document my finding and then investigate.

Ivan-Zhou commented 9 months ago

This is my data config: https://github.com/stanford-crfm/levanter/blob/background-job/config/data/rpv1_llama.yaml

dlwh commented 9 months ago

(try main again)