Memory overflow issue with long-context data using datatrove

justHungryMan commented 1 month ago

I've been using datatrove to read .jsonl files and count tokens with token_counter in a local node. I'm encountering an issue where the process was killed due to memory overflow while handling long-context data (like books). It seems weird that simply counting tokens would cause such a problem, especially since my own token counting script doesn't have any memory issues.

I haven't completely nailed down the cause, but it looks like it might be related to executor/base.py keeping data in memory until all pipeline tasks (with all data) are completed. Could this be the reason for the memory overflow? If so, do you have any suggestions on how we might improve or workaround this?

Thanks for any insights you can provide!

hynky1999 commented 1 month ago

I think it's connected to https://github.com/huggingface/datatrove/issues/161. Are you running locally using multiple tasks ?

justHungryMan commented 1 month ago

Yes, there are several scenarios, and only read jsonl and count tokens.

1 jsonl(~40GB), 1 task , but OOM
1024 jsonl (~sum 1.5TB), 1024 tasks, 32 workers, but OOM

My memory: 256GB

Only long-context data occurs OOM.

huggingface / datatrove

Memory overflow issue with long-context data using datatrove #204