Functional: It did not preserve register order, so it was computing a different reduction.
Performance: When converting back to the original tensor type, it did: reshape(convert_layout(res)). That means the reshape operation served as an anchor and the suboptimal slice layout was propagated.
This was fixed as follows:
Keep register order.
Do convert_layout(reshape(res)) when converting back to the original type, thus propagating the more optimal layout.
Original implementation had two critical issues:
reshape(convert_layout(res))
. That means thereshape
operation served as an anchor and the suboptimal slice layout was propagated.This was fixed as follows:
convert_layout(reshape(res))
when converting back to the original type, thus propagating the more optimal layout.See implementation for further details.
Closes #2752