pytorch / torchrec

Pytorch domain library for recommendation systems
https://pytorch.org/torchrec/
BSD 3-Clause "New" or "Revised" License
1.95k stars 441 forks source link

TorchRec 2D Parallel #2554

Closed iamzainhuda closed 4 days ago

iamzainhuda commented 1 week ago

Summary: In this diff we introduce a new parallelism strategy for scaling recommendation model training called 2D parallel. In this case, we scale model parallel through data parallel, hence, the 2D name. This diff enables the pathway to scaling training on 4k+ GPUs

Our new entry point, DMPCollection, subclasses DMP and is meant to be a drop in replacement to integrate 2D parallelism in distributed training. By setting the total number of GPUs to train across and the number of GPUs to locally shard across (aka one replication group), users can train their models in the same training loop but now over a larger number of GPUs.

The current implementation shards the model such that, for a given shard, its replicated shards lie on the ranks within the node. This significantly improves the performance of the all-reduce communication (parameter sync) by utilizing intra-node bandwidth.

Example Use Case: Consider a setup with 2 nodes, each with 4 GPUs. The sharding groups could be:

NOTE: We have to pass global process group to the DDPWrapper otherwise some of the unsharded parameters will not get optimizer applied to them. will result in numerically inaccurate results

Differential Revision: D61643328

facebook-github-bot commented 1 week ago

This pull request was exported from Phabricator. Differential Revision: D61643328

facebook-github-bot commented 1 week ago

This pull request was exported from Phabricator. Differential Revision: D61643328

facebook-github-bot commented 1 week ago

This pull request was exported from Phabricator. Differential Revision: D61643328

facebook-github-bot commented 1 week ago

This pull request was exported from Phabricator. Differential Revision: D61643328

facebook-github-bot commented 1 week ago

This pull request was exported from Phabricator. Differential Revision: D61643328

facebook-github-bot commented 1 week ago

This pull request was exported from Phabricator. Differential Revision: D61643328

facebook-github-bot commented 5 days ago

This pull request was exported from Phabricator. Differential Revision: D61643328

facebook-github-bot commented 5 days ago

This pull request was exported from Phabricator. Differential Revision: D61643328

facebook-github-bot commented 4 days ago

This pull request was exported from Phabricator. Differential Revision: D61643328