Fix segfault in training unit tests

sryap commented 1 month ago

Summary: Before this diff, there was a segmentation fault error (P1507485454) when running the SSD-TBE unit tests. It was caused by the premature tensor deallocation when the unit test invoked set_cuda. Since set_cuda is non-blocking asynchronous, the unit test must ensure that the input tensors are alive until set_cuda is complete. However, the unit test allocated an input tensor inside a for-loop (in a stack memory). The tensor was deallocated as soon as each for-loop iteration was done -- causing segmentation fault.

This diff fixes the problem by making sure that the input tensor is alive until set_cuda is complete by moving the scope of the tensor outside of the for-loop and adding a proper synchronization.

Differential Revision: D60627636

netlify[bot] commented 1 month ago

Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
Latest commit	7c4b2764b8638eff1e615f583dcdfa282199c270
Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/66ad6d9e1a208e00082cb34e
Deploy Preview	https://deploy-preview-2929--pytorch-fbgemm-docs.netlify.app
Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

facebook-github-bot commented 1 month ago

This pull request was exported from Phabricator. Differential Revision: D60627636

facebook-github-bot commented 1 month ago

This pull request was exported from Phabricator. Differential Revision: D60627636

facebook-github-bot commented 1 month ago

This pull request was exported from Phabricator. Differential Revision: D60627636

facebook-github-bot commented 1 month ago

This pull request has been merged in pytorch/FBGEMM@9cbf073787eca4ff5e296f2ea74fe6adbcd279eb.

pytorch / FBGEMM