If you have a Tensor and apply a "standard" operation to it, followed by an "inplace" operation, and then resolve the tensor and read it back to GPU, the readback result is the result of the "standard" operation.
This keeps popping up and I keep kicking the can down the road.
A simple fix (despite not knowing the underlying cause) would be do a GPU buffer copy to a fresh buffer before reading back, but i'd like to know the reason.
Whisper encoder stem nicely reproduces the issue.
Readback the stem output with Binary inplace enabled and disabled.
If enabled, you get the prior permute output.
If you have a Tensor and apply a "standard" operation to it, followed by an "inplace" operation, and then resolve the tensor and read it back to GPU, the readback result is the result of the "standard" operation.
This keeps popping up and I keep kicking the can down the road.
A simple fix (despite not knowing the underlying cause) would be do a GPU buffer copy to a fresh buffer before reading back, but i'd like to know the reason.