Investigating this further, I only found IR differences in the following dispatch (and similar ones) with a tensor.unpack + generic op with dynamic shapes + tensor.pack:
Without the PR, we generate scalar loads (single-element "vector" loads) for the tensor.unpack because unit dims are not removed. With the PR, the vector loads have a proper size.
To my surprise, the generic op with dynamic shapes is not vectorized both before and after the PR because Pixel 6 doesn't support masking. I would expect peeling to be used for these cases but it looks like even though the peeling strategy is selected, that decision is overridden at some point in the pipeline.
Given that we don't vectorize the generic op, generating scalar loads for the tensor.unpack helps the forwarding of the loaded elements into the scalar operations generated for the generic op. When we generate proper vector loads for tensor.unpack with the PR, that forwarding doesn't happen and the end result is worse even though we are vectorizing the tensor.unpack op more efficiently.
Moving forward, I think we should:
- [x] Land the drop unit dimension PR and deal with the regression later given that it's only impacting Pixel 6.
- [ ] Remove the peeling strategy [override](https://github.com/dcaballe/iree/blob/cfe77c31cbb27c91ac0acb2422f5349e2c88d17f/compiler/src/iree/compiler/Codegen/LLVMCPU/KernelDispatch.cpp#L2131-L2135). That's really surprising for the user. If that strategy is not the appropriate one, we should make sure it's not selected at any point in the pipeline.
- [ ] Come up with a solution to vectorize the dispatch above for targets without masking support. This means, having an alternative lowering for pack/unpack ops that doesn't use masking.
I caught up with Hanhan about this and we agreed on moving forward with the re-land and deal with the Pixel 6 regression later, using this issue for tracking. Working on the reland.
We ran into a regression in https://github.com/openxla/iree/pull/16286#issuecomment-1925567807 that impacted DeepLabV3 on Pixel 6. The regression was gone when reverting a PR that dropped some unit dims from vector operations: https://github.com/llvm/llvm-project/pull/79752
Investigating this further, I only found IR differences in the following dispatch (and similar ones) with a
tensor.unpack
+ generic op with dynamic shapes +tensor.pack
:Without the PR, we generate scalar loads (single-element "vector" loads) for the
tensor.unpack
because unit dims are not removed. With the PR, the vector loads have a proper size.To my surprise, the generic op with dynamic shapes is not vectorized both before and after the PR because Pixel 6 doesn't support masking. I would expect peeling to be used for these cases but it looks like even though the peeling strategy is selected, that decision is overridden at some point in the pipeline.
Given that we don't vectorize the generic op, generating scalar loads for the
tensor.unpack
helps the forwarding of the loaded elements into the scalar operations generated for the generic op. When we generate proper vector loads fortensor.unpack
with the PR, that forwarding doesn't happen and the end result is worse even though we are vectorizing thetensor.unpack
op more efficiently.Moving forward, I think we should:
cc: @hanhanW, @Max191, @MaheshRavishankar