pytorch / torchrec

Pytorch domain library for recommendation systems
https://pytorch.org/torchrec/
BSD 3-Clause "New" or "Revised" License
1.95k stars 441 forks source link

add NJT/TD support for EBC and pipeline benchmark #2581

Open TroyGarden opened 1 day ago

TroyGarden commented 1 day ago

Summary:

Documents

Context

Details

Conclusion

  1. [Enablement] With this approach (replacing the KJT permute with TD-KJT conversion), the EBC can now take TensorDict as the module input in both single-GPU and multi-GPU (sharded) scenarios, tested with TrainPipelineBase, TrainPipelineSparseDist, TrainPipelineSemiSync, and TrainPipelinePrefetch.
  2. [Performance] The TD host-to-device data transfer might not necessarily be a concern/blocker for the most commonly used train pipeline (TrainPipelineSparseDist).
  3. [Feature Support] In order to become production-ready, the TensorDict needs to (1) integrate the KJT.weights data, and (2) to support the variable batch size, which are almost used in all the production models.
  4. [Improvement] There are two major operations we can improve: (1) move TensorDict from host to device, and (2) convert TD to KJT. Currently they are both in the vanilla state. Since we are not sure how the real traces would be like with production models, we can't tell if these improvements are needed/helpful.

Differential Revision: D65103519

facebook-github-bot commented 1 day ago

This pull request was exported from Phabricator. Differential Revision: D65103519