huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Apache License 2.0
1.78k stars 109 forks source link

Memory overflow issue with long-context data using datatrove #204

Open justHungryMan opened 1 month ago

justHungryMan commented 1 month ago

I've been using datatrove to read .jsonl files and count tokens with token_counter in a local node. I'm encountering an issue where the process was killed due to memory overflow while handling long-context data (like books). It seems weird that simply counting tokens would cause such a problem, especially since my own token counting script doesn't have any memory issues.

I haven't completely nailed down the cause, but it looks like it might be related to executor/base.py keeping data in memory until all pipeline tasks (with all data) are completed. Could this be the reason for the memory overflow? If so, do you have any suggestions on how we might improve or workaround this?

Thanks for any insights you can provide!

hynky1999 commented 1 month ago

I think it's connected to https://github.com/huggingface/datatrove/issues/161. Are you running locally using multiple tasks ?

justHungryMan commented 1 month ago

Yes, there are several scenarios, and only read jsonl and count tokens.

My memory: 256GB

Only long-context data occurs OOM.