sambanova / generative_data_prep

Apache License 2.0
58 stars 7 forks source link

Child processes killed silently, causes code to hang #34

Closed snova-zoltanc closed 11 months ago

snova-zoltanc commented 1 year ago

The OS sometimes kills child processes that are tokenizing text, and this causes the tokenization to hang because the multiprocessing library does not fail out when one of the child processes are killed.

If anyone encounters this issue - please pass in the --num_workers flag with a low number of workers, this will decrease your total memory consumption because there are fewer parallel workers.

In order to fix these we need two main changes

  1. Fail out gracefully with information about what happened
  2. Decrease the memory consumption of this code, there is no reason it needs so much RAM that would cause OOM issues.
snova-zoltanc commented 12 months ago

This PR for failing gracefully when child processes are killed addresses point 1.

snova-zoltanc commented 11 months ago

With some memory profiling we can see that the processes do not use up too much RAM, it seems that the processes are not being killed due to OOM. So since we have finished the PR above to fail out gracefully when a child processes is killed and we have no reason to believe the code is taking up too much RAM we will close this. If anyone experiences tokenization processes being killed, please open another issue.