pytorch / PiPPy

Pipeline Parallelism for PyTorch
BSD 3-Clause "New" or "Revised" License
726 stars 86 forks source link

Update all hf examples to have dist.barrier #1139

Closed muellerzr closed 2 months ago

muellerzr commented 3 months ago

Without having dist.barrier(), all of the HF examples wind up hanging since we're destroying the pg before all comms have completed in these small examples, leading to a hang. This PR adds dist.barrier() just before dist.destroy_process_group() to fix this.