Closed Kanya-Mo closed 2 weeks ago
This op has no unit tests for xpu in stock pytorch at the moment, essentially due to lack of xpu counterpart of torch.cuda._sleep(). In CUDA unit tests of this op, torch.cuda._sleep() is used to delay one stream suffciently long. So, during local testing I replace it with an expensive kernel and translate two tests from test_cuda.py to effectively test this op (and it passed).
I will work on finding a good way to add unit tests for this op.
Please be aware - this OP is widely used in distributed FSDP.