Currently there's no available UT in pytorch to test record_stream. These two tests are adapted from corresponding cuda tests. The only difference is I use an actual expensive kernel in place of torch.cuda._sleep to create delay in one stream. The add kernel here would create sufficient delay based on max memory bandwidth among current supported gpus.
Currently there's no available UT in pytorch to test record_stream. These two tests are adapted from corresponding cuda tests. The only difference is I use an actual expensive kernel in place of
torch.cuda._sleep
to create delay in one stream. The add kernel here would create sufficient delay based on max memory bandwidth among current supported gpus.