Currently, the logic to calculate and display loss is done synchronously which is implicit due to loss.item(). This copies the loss tensor from GPU to CPU in sync which is inefficient.
The solution is to asynchronously copy loss tensor in a non-blocking approach by using tensor.to with copy and non_blocking set to True.
Currently, the logic to calculate and display loss is done synchronously which is implicit due to
loss.item()
. This copies the loss tensor from GPU to CPU in sync which is inefficient.The solution is to asynchronously copy
loss
tensor in a non-blocking approach by usingtensor.to
withcopy
andnon_blocking
set toTrue
.