I've tried separating out the setup / transfer times in a few ways that ended up being messy. Here's an approach that solves the specific problem in I think a reasonable way.
It pulls all device management responsibility out of the example.
Speculatively move all results to CPU, it won't hurt the tensors already there and time wasted should be minimal - and not affect any benchmark numbers.
I've tried separating out the setup / transfer times in a few ways that ended up being messy. Here's an approach that solves the specific problem in I think a reasonable way.
It pulls all device management responsibility out of the example. Speculatively move all results to CPU, it won't hurt the tensors already there and time wasted should be minimal - and not affect any benchmark numbers.
Thoughts?
AB#19085