open-compass / VLMEvalKit

Open-source evaluation toolkit of large vision-language models (LVLMs), support 160+ VLMs, 50+ benchmarks
https://huggingface.co/spaces/opencompass/open_vlm_leaderboard
Apache License 2.0
1.34k stars 188 forks source link

Fix program error exit without synchronization. #464

Closed lerogo closed 2 months ago

lerogo commented 2 months ago

Fix program error exit without synchronization.

kennymckormick commented 3 weeks ago

Hi, @lerogo , I'm not familiar with the context of this pull request? What problem will occur without the two lines of codes? Actually, I find that after adding these two lines, only process on GPU #0 stops while other processes hang after the evaluation is finished.

lerogo commented 3 weeks ago

Hi, @lerogo , I'm not familiar with the context of this pull request? What problem will occur without the two lines of codes? Actually, I find that after adding these two lines, only process on GPU #0 stops while other processes hang after the evaluation is finished.

On machines with garbage collection issues (for example, without the environment variable LD_PRELOAD="/usr/lib64/libjemalloc.so.2"), exiting directly will result in the inability to generate the final result file, leaving only intermediate files. Using dist.barrier() to synchronize the processes is a standard code operation. If other processes are suspended, perhaps dist.destroy_process_group() should be used at the end of the code to explicitly collect garbage.

kennymckormick commented 3 weeks ago

Thanks, gonna give a try

Hi, @lerogo , I'm not familiar with the context of this pull request? What problem will occur without the two lines of codes? Actually, I find that after adding these two lines, only process on GPU #0 stops while other processes hang after the evaluation is finished.

On machines with garbage collection issues (for example, without the environment variable LD_PRELOAD="/usr/lib64/libjemalloc.so.2"), exiting directly will result in the inability to generate the final result file, leaving only intermediate files. Using dist.barrier() to synchronize the processes is a standard code operation. If other processes are suspended, perhaps dist.destroy_process_group() should be used at the end of the code to explicitly collect garbage.