Enable NCCL_ASYNC_ERROR_HANDLING in torchelastic

osalpekar commented 3 years ago

Summary: NCCL Async Error Handling is a new mechanism implemented in ProcessGroupNCCL to provide reliability for DDP training runs using NCCL. See here for a more detailed background and implementation details: https://github.com/pytorch/pytorch/issues/46874.

At a high-level, this system was designed to ensure desynchronization, high GPU utilization, and NCCL errors don't cause indefinite hanging in distributed training runs. This system catches these errors without any perf impact and brings down the training process, and torchelastic can detect this and restart training from the previous checkpoint. The time after which stuck collectives are detected can be tuned using the timeout argument to init_process_group.

Fixes: https://github.com/pytorch/elastic/issues/136

Differential Revision: D23610237

facebook-github-bot commented 3 years ago

This pull request was exported from Phabricator. Differential Revision: D23610237

facebook-github-bot commented 3 years ago

This pull request was exported from Phabricator. Differential Revision: D23610237

facebook-github-bot commented 3 years ago

This pull request was exported from Phabricator. Differential Revision: D23610237

facebook-github-bot commented 3 years ago

This pull request has been merged in pytorch/elastic@766cab89ea2d432ae071b584bbcf67c3d3822f3b.

pytorch / elastic

Enable NCCL_ASYNC_ERROR_HANDLING in torchelastic #133