sangmichaelxie / doremi

Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets
https://arxiv.org/abs/2305.10429
MIT License
286 stars 32 forks source link

program stuck (when ”Loading cached shuffled indices for dataset at ...“) #29

Open ccx06 opened 5 months ago

ccx06 commented 5 months ago

When running the training code, the loading cached shuffled indexes program is stuck? How to solve it?

bugs
xszheng2020 commented 2 months ago

Same issue here, have you solved it? @ccx06 @sangmichaelxie

sangmichaelxie commented 2 months ago

Maybe there is some kind of OOM issue or num_workers is set too high? Does this happen every time and on all datasets?

xszheng2020 commented 2 months ago

I set the num_workers as 1. I tried to reduce the RANDOM_BATCH_SIZE in dataloader.py but it does not work.

It seems the issue is related to caching

06/25/2024 03:50:31 - INFO - datasets.arrow_dataset - Caching indices mapping at ~/doremi/preprocessed/train/Pile-CC/66/cache-abce09a69c09a6c6.arrow