Summary:
While working on TTFB it was observed that sharding of embededed bag is taking significant time and is one of the biggest contributors to TTFB especially on large jobs.
After strobelight data analysis it was clear that most of the time is spent on all_gather collective calls. Currently we construct sharded tensor one by one calling collective to exchange metadata which is not very efficient. More optimal approach is letting all the ranks build their portion of metadata for all tensors and exchange it with single collective call, thus significantly reducing overhead and improve performance.
Summary: While working on TTFB it was observed that sharding of embededed bag is taking significant time and is one of the biggest contributors to TTFB especially on large jobs. After strobelight data analysis it was clear that most of the time is spent on all_gather collective calls. Currently we construct sharded tensor one by one calling collective to exchange metadata which is not very efficient. More optimal approach is letting all the ranks build their portion of metadata for all tensors and exchange it with single collective call, thus significantly reducing overhead and improve performance.
Testing on 256 ranks showed ~15x speed up.
Differential Revision: D65489998