Add aten::record_stream

Kanya-Mo commented 2 weeks ago

[x] record_stream

Kanya-Mo commented 2 weeks ago

This op has no unit tests for xpu in stock pytorch at the moment, essentially due to lack of xpu counterpart of torch.cuda._sleep(). In CUDA unit tests of this op, torch.cuda._sleep() is used to delay one stream suffciently long. So, during local testing I replace it with an expensive kernel and translate two tests from test_cuda.py to effectively test this op (and it passed).

I will work on finding a good way to add unit tests for this op.

zhangxiaoli73 commented 2 weeks ago

Please be aware - this OP is widely used in distributed FSDP.

intel / torch-xpu-ops

Add aten::record_stream #1047