Closed eastB233 closed 1 week ago
Simpler test case is
extern const int npy;
extern const int nx;
double* rspace;
void recip2real(double* out, const double factor)
{
#ifdef _OPENMP
#pragma omp parallel for collapse(2)
#endif
for (int ix = 0; ix < nx; ++ix)
{
for (int ipy = 0; ipy < npy; ++ipy) {
out[ix * npy + ipy] += factor * rspace[ix * npy + ipy];
}
}
}
Command is clang -O3 -march=armv9-a -fopenmp -mllvm -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue -S
It can be seen at https://godbolt.org/z/6MGsecEr5
I don't know much about VPlan, and as far as I understand by tracing the code,
VPlan 'Initial VPlan for VF={vscale x 1,vscale x 2},UF>=1' {
vp<%2> = original trip-count
ph:
EMIT vp<%2> = EXPAND SCEV (1 + (-1 * %5) + ((-1 + ((zext i32 %0 to i64) * (sext i32 %1 to i64)))<nsw> smin %4))
No successors
vector.ph:
EMIT vp<%3> = TC > VF ? TC - VF : 0 vp<%2>
EMIT vp<%4> = VF * Part + ir<0>
EMIT vp<%5> = active lane mask vp<%4>, vp<%2>
Successor(s): vector loop
<x1> vector loop: {
vector.body:
EMIT vp<%6> = CANONICAL-INDUCTION
ACTIVE-LANE-MASK-PHI vp<%7> = phi vp<%5>, vp<%27>
vp<%8> = DERIVED-IV ir<%5> + vp<%6> * ir<1>
vp<%9> = SCALAR-STEPS vp<%8>, ir<1>
Successor(s): pred.sdiv
<xVFxUF> pred.sdiv: {
pred.sdiv.entry:
BRANCH-ON-MASK vp<%7>
Successor(s): pred.sdiv.if, pred.sdiv.continue
pred.sdiv.if:
CLONE ir<%div24> = sdiv vp<%9>, ir<%conv6>
Successor(s): pred.sdiv.continue
pred.sdiv.continue:
PHI-PREDICATED-INSTRUCTION vp<%11> = ir<%div24>
No successors
}
Successor(s): omp.inner.for.body.0
omp.inner.for.body.0:
CLONE ir<%conv26> = trunc vp<%11>
CLONE ir<%mul36> = mul nsw vp<%11>, ir<%conv6>
CLONE ir<%sub37> = sub nsw vp<%9>, ir<%mul36>
CLONE ir<%conv40> = trunc ir<%sub37>
CLONE ir<%mul41> = mul nsw ir<%1>, ir<%conv26>
CLONE ir<%add42> = add nsw ir<%mul41>, ir<%conv40>
CLONE ir<%idxprom> = sext ir<%add42>
CLONE ir<%arrayidx> = getelementptr inbounds ir<%6>, ir<%idxprom>
WIDEN ir<%8> = load ir<%arrayidx>, vp<%7>
WIDEN ir<%mul43> = fmul contract ir<%2>, ir<%8>
CLONE ir<%arrayidx47> = getelementptr inbounds ir<%7>, ir<%idxprom>
WIDEN ir<%9> = load ir<%arrayidx47>, vp<%7>
WIDEN ir<%add48> = fadd contract ir<%mul43>, ir<%9>
WIDEN store ir<%arrayidx47>, ir<%add48>, vp<%7>
EMIT vp<%25> = VF * UF + vp<%6>
EMIT vp<%26> = VF * Part + vp<%6>
EMIT vp<%27> = active lane mask vp<%26>, vp<%3>
EMIT vp<%28> = not vp<%27>
EMIT branch-on-cond vp<%28>
No successors
}
Successor(s): middle.block
VPRegion pred.sdiv
fails at assertion
void VPRegionBlock::execute(VPTransformState *State) {
...
if (!isReplicator()) {
...
return;
}
...
for (...) {
assert(!State->VF.isScalable() && "VF is assumed to be non scalable.");
}
}
I think VPRegion pred.sdiv
should have isReplicator() == false
or pred.sdiv
just should not exist.
pred.sdiv
is splitted from WorkList
static void addReplicateRegions(VPlan &Plan) {
SmallVector<VPReplicateRecipe *> WorkList;
for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(
vp_depth_first_deep(Plan.getEntry()))) {
for (VPRecipeBase &R : *VPBB)
if (auto *RepR = dyn_cast<VPReplicateRecipe>(&R)) {
if (RepR->isPredicated())
WorkList.push_back(RepR);
}
}
...
}
by VPRecipe CLONE ir<%div24> = sdiv ir<%.omp.iv.065>, ir<%conv6>, vp<%7>
, where the instruction is %div24 = sdiv i64 %.omp.iv.065, %conv6
I think this VPRecipe should have isPredicated() == false
here, so it will not be splitted.
VPRecipeOrVPValueTy VPRecipeBuilder::handleReplication(...) {
bool IsUniform = LoopVectorizationPlanner::getDecisionAndClampRange(
[&](ElementCount VF) { return CM.isUniformAfterVectorization(I, VF); },
Range);
bool IsPredicated = CM.isPredicatedInst(I);
...
VPValue *BlockInMask = nullptr;
if (!IsPredicated) {
// Finalize the recipe for Instr, first if it is not predicated.
LLVM_DEBUG(dbgs() << "LV: Scalarizing:" << *I << "\n");
} else {
LLVM_DEBUG(dbgs() << "LV: Scalarizing and predicating:" << *I << "\n");
// Instructions marked for predication are replicated and a mask operand is
// added initially. Masked replicate recipes will later be placed under an
// if-then construct to prevent side-effects. Generate recipes to compute
// the block mask for this region.
BlockInMask = createBlockInMask(I->getParent(), Plan);
}
auto *Recipe = new VPReplicateRecipe(I, Plan.mapToVPValues(I->operands()),
IsUniform, BlockInMask);
return toVPRecipeResult(Recipe);
}
I notice that instruction I
(%div24 = sdiv i64 %.omp.iv.065, %conv6
) do not need to vectorize, because function isScalarAfterVectorization
returns true and I
is just used to calculate the index.
It seems reasonable that a scalar
instruction does not need Predicated
.
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 907b8ce002e8..76a5704a61c5 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -9819,7 +9819,9 @@ VPRecipeOrVPValueTy VPRecipeBuilder::handleReplication(Instruction *I,
[&](ElementCount VF) { return CM.isUniformAfterVectorization(I, VF); },
Range);
- bool IsPredicated = CM.isPredicatedInst(I);
+ bool IsPredicated = LoopVectorizationPlanner::getDecisionAndClampRange(
+ [&](ElementCount VF) { return CM.isPredicatedInst(I) && !CM.isScalarAfterVectorization(I, VF); },
+ Range);
// Even if the instruction is not marked as uniform, there are certain
// intrinsic calls that can be effectively treated as such, so we check for
Just my guess, I'm not sure if it is correct direction.
Ping @fhahn
I think I misunderstand something, and the changes above may be wrong.
And I have another way. In instruction %div24 = sdiv i64 %.omp.iv.065, %conv6
, %conv6
is invariant in loop, so it seems we do not need Predicated
.
I try the following patch,
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index c7c19ef456c7..f294703e1529 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -3856,21 +3856,21 @@ bool LoopVectorizationCostModel::isPredicatedInst(Instruction *I) const {
!Legal->blockNeedsPredication(I->getParent()))
return false;
return true;
}
case Instruction::UDiv:
case Instruction::SDiv:
case Instruction::SRem:
case Instruction::URem:
// TODO: We can use the loop-preheader as context point here and get
// context sensitive reasoning
- return !isSafeToSpeculativelyExecute(I);
+ return !isSafeToSpeculativelyExecute(I) && !Legal->isInvariant(I->getOperand(1));
case Instruction::Call:
return Legal->isMaskRequired(I);
}
}
std::pair<InstructionCost, InstructionCost>
LoopVectorizationCostModel::getDivRemSpeculationCost(Instruction *I,
ElementCount VF) const {
assert(I->getOpcode() == Instruction::UDiv ||
I->getOpcode() == Instruction::SDiv ||
ping @sdesmalen-arm @davemgreen @paulwalker-arm
@eastB233 unfortunately I don't think this change is correct, e.g. consider https://github.com/llvm/llvm-project/blob/967eba07549d64f15e7a91e798aa46214704f62b/llvm/test/Transforms/LoopVectorize/X86/divs-with-tail-folding.ll#L251 when the sdiv/udiv is executed conditionally in the loop
The IR is put at the end.
Compile command is
opt -passes=loop-vectorize -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue
The error is
It can be seen at https://godbolt.org/z/s4bqzdKPP