Closed lerogo closed 2 months ago
Hi, @lerogo , I'm not familiar with the context of this pull request? What problem will occur without the two lines of codes? Actually, I find that after adding these two lines, only process on GPU #0 stops while other processes hang after the evaluation is finished.
Hi, @lerogo , I'm not familiar with the context of this pull request? What problem will occur without the two lines of codes? Actually, I find that after adding these two lines, only process on GPU #0 stops while other processes hang after the evaluation is finished.
On machines with garbage collection issues (for example, without the environment variable LD_PRELOAD="/usr/lib64/libjemalloc.so.2"), exiting directly will result in the inability to generate the final result file, leaving only intermediate files. Using dist.barrier() to synchronize the processes is a standard code operation. If other processes are suspended, perhaps dist.destroy_process_group() should be used at the end of the code to explicitly collect garbage.
Thanks, gonna give a try
Hi, @lerogo , I'm not familiar with the context of this pull request? What problem will occur without the two lines of codes? Actually, I find that after adding these two lines, only process on GPU #0 stops while other processes hang after the evaluation is finished.
On machines with garbage collection issues (for example, without the environment variable LD_PRELOAD="/usr/lib64/libjemalloc.so.2"), exiting directly will result in the inability to generate the final result file, leaving only intermediate files. Using dist.barrier() to synchronize the processes is a standard code operation. If other processes are suspended, perhaps dist.destroy_process_group() should be used at the end of the code to explicitly collect garbage.
Fix program error exit without synchronization.