Threadblock output saver currently allocates a separate stensor for the output tensor, which results in high shared memory overhead. We should enable in-place optimization for output saver and close this issue once the implementation is merged to the main branch.
Threadblock output saver currently allocates a separate stensor for the output tensor, which results in high shared memory overhead. We should enable in-place optimization for output saver and close this issue once the implementation is merged to the main branch.