I ran a program on a server with 8 GPUs and used 2 of them. I specified the number of processes as 2 when starting the program. Then I use your synchronize function which is in comm.py. However, I found that 2 processes stopped running after reaching dist.barrier(), which puzzled me. The 2 processes have reached the synchronization point, why are they blocked?
I ran a program on a server with 8 GPUs and used 2 of them. I specified the number of processes as 2 when starting the program. Then I use your synchronize function which is in comm.py. However, I found that 2 processes stopped running after reaching dist.barrier(), which puzzled me. The 2 processes have reached the synchronization point, why are they blocked?