huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Apache License 2.0
1.77k stars 109 forks source link

Memory overhead in multiprocessing #161

Open jordane95 opened 2 months ago

jordane95 commented 2 months ago

When using fasttext filter, I find that the fasttext model is copied by each processes, which introduces significant memory overhead. However, to my knowledge, each fasttext model is read only and can be stored in a shared memory space across all processes.

Can we optimize the current code for memory saving? I find that using mp.manager can create shared memory and avoid memory copying. But I find it quite hard to integrate in the current code as the manager is initialized at the executor level, but not passed to each pipeline step.

guipenedo commented 2 months ago

Indeed there would maybe be some complications. I would be curious, however, to know what the performance (in terms of speed) implications of loading the model from shared memory would be, have you tested this?

justHungryMan commented 1 month ago

I have a question regarding memory overhead. I created and ran an executor designed to count tokens on approximately 2TB of text (jsonl), but it gets stuck every time I run it. According to the memory and CPU usage data, the memory usage fills up the 256GB I have available, and after getting stuck, the CPU usage drops from 99% to 0%.

The problem is that there are no error messages in the log, making it impossible to resolve the issue. Does anyone have any suggestions on how to address this? I suspect this might be a memory overhead issue.