Open justHungryMan opened 1 month ago
I think it's connected to https://github.com/huggingface/datatrove/issues/161. Are you running locally using multiple tasks ?
Yes, there are several scenarios, and only read jsonl and count tokens.
My memory: 256GB
Only long-context data occurs OOM.
I've been using datatrove to read .jsonl files and count tokens with token_counter in a local node. I'm encountering an issue where the process was killed due to memory overflow while handling long-context data (like books). It seems weird that simply counting tokens would cause such a problem, especially since my own token counting script doesn't have any memory issues.
I haven't completely nailed down the cause, but it looks like it might be related to executor/base.py keeping data in memory until all pipeline tasks (with all data) are completed. Could this be the reason for the memory overflow? If so, do you have any suggestions on how we might improve or workaround this?
Thanks for any insights you can provide!