Closed Ivan-Zhou closed 5 months ago
I have been trying to train Llama2 with RedPajama through Data Mixture for good randomization. Out of box it doesn't work: the cluster keeps waiting for some chunks, while a few millions docs have been processed at every workers:
1442 2024-01-29T00:42:37 - 0 - levanter.data.shard_cache - shard_cache.py:1326 - WARNING :: Waiting for chunk 0 after 2442 seconds 1443 2024-01-29T00:42:47 - 0 - preprocessing.train - shard_cache.py:569 - INFO :: done: Shards: 0 | Chunks: 413 | Docs: 1691648 1444 2024-01-29T00:42:47 - 0 - preprocessing.train - shard_cache.py:569 - INFO :: done: Shards: 0 | Chunks: 324 | Docs: 1327104 1445 2024-01-29T00:42:55 - 0 - preprocessing.train - shard_cache.py:569 - INFO :: done: Shards: 0 | Chunks: 325 | Docs: 1331200 1446 2024-01-29T00:42:57 - 0 - levanter.data.shard_cache - shard_cache.py:1326 - WARNING :: Waiting for chunk 0 after 2462 seconds
I did some experimentation with different combination of subsets, then I narrowed down the errors:
The book dataset seems be the one blocker. The issue does not appear with RP-Wiki and RP-Stack, but only happens for RP-book.
Then I keep experimenting:
The last job ran for 1K steps. It looks smooth and good. This is what we are looking for with data mixture!
Not sure what is wrong with the book dataset. I want to document my finding and then investigate.
This is my data config: https://github.com/stanford-crfm/levanter/blob/background-job/config/data/rpv1_llama.yaml
(try main again)
I have been trying to train Llama2 with RedPajama through Data Mixture for good randomization. Out of box it doesn't work: the cluster keeps waiting for some chunks, while a few millions docs have been processed at every workers:
I did some experimentation with different combination of subsets, then I narrowed down the errors:
The book dataset seems be the one blocker. The issue does not appear with RP-Wiki and RP-Stack, but only happens for RP-book.
Then I keep experimenting:
The last job ran for 1K steps. It looks smooth and good. This is what we are looking for with data mixture!
Not sure what is wrong with the book dataset. I want to document my finding and then investigate.