Open jordane95 opened 2 months ago
Indeed there would maybe be some complications. I would be curious, however, to know what the performance (in terms of speed) implications of loading the model from shared memory would be, have you tested this?
I have a question regarding memory overhead. I created and ran an executor designed to count tokens on approximately 2TB of text (jsonl), but it gets stuck every time I run it. According to the memory and CPU usage data, the memory usage fills up the 256GB I have available, and after getting stuck, the CPU usage drops from 99% to 0%.
The problem is that there are no error messages in the log, making it impossible to resolve the issue. Does anyone have any suggestions on how to address this? I suspect this might be a memory overhead issue.
When using fasttext filter, I find that the fasttext model is copied by each processes, which introduces significant memory overhead. However, to my knowledge, each fasttext model is read only and can be stored in a shared memory space across all processes.
Can we optimize the current code for memory saving? I find that using mp.manager can create shared memory and avoid memory copying. But I find it quite hard to integrate in the current code as the manager is initialized at the executor level, but not passed to each pipeline step.