weiguozhi commented 8 years ago


Bugzilla Link	25342
Resolution	FIXED
Resolved on	Apr 27, 2016 12:07
Version	trunk
OS	Linux
Attachments	testcase
CC	@vns-mn,@echristo,@hfinkel,@nemanjai

Extended Description

Compile the attached source code with options

~/llvm/obj/bin/clang++ --target=powerpc64le-grtev4-linux-gnu -O2 -fno-strict-aliasing -m64 -mvsx -mcpu=power8 -ffp-contract=fast

llvm generates following code for the inner loop

    lfsu 1, 16(22)
    addi 24, 24, -2
    cmplwi 6, 24, 0
    lfsu 2, 16(23)
    lxsspx 4, 22, 5
    lxsspx 3, 23, 5
    stw 25, -112(1)           // A
    ori 2, 2, 0 
    lxsspx 5, 0, 6            // B
    stw 26, -108(1)           // C
    xsmulsp 7, 3, 4 
    ori 2, 2, 0 
    lxsspx 6, 0, 7            // D
    lxsspx 9, 23, 10
    lxsspx 8, 22, 10
    lxsspx 11, 23, 8
    lxsspx 12, 22, 8
    xsmaddasp 6, 2, 4 
    xsmulsp 10, 9, 8 
    xsmsubasp 7, 2, 1 
    xsmaddasp 6, 3, 1 
    xsmsubasp 10, 11, 12
    xsaddsp 5, 5, 7 
    xsmaddasp 6, 11, 8
    xsaddsp 13, 5, 10
    xsmaddasp 6, 9, 12
    stxsspx 13, 0, 11         // E
    ori 2, 2, 0 
    lwz 25, -100(1)           // F
    stxsspx 6, 0, 12          // G
    ori 2, 2, 0 
    lwz 26, -104(1)           // H
    bne      6, .LBB0_9

The variable std::complex sum is allocated to integer registers r25,r26. Before using it, instructions ABCD move it to floating point registers, after generating new values, instructions EFGH move it back to integer registers.

GCC allocates sum to floating point registers directly.

On power8, llvm generated code is 4.5x slower than gcc.

llvmbot commented 2 years ago

mentioned in issue llvm/llvm-bugzilla-archive#26445

llvmbot commented 8 years ago

Assigned to Carrot to make it easy to see that someone is looking at this. There is a patch posted on Phabricator waiting for review.

llvmbot commented 8 years ago

Assigned to Carrot to make it easy to see that someone is looking at this. There is a patch posted on Phabricator waiting for review.

weiguozhi commented 8 years ago

Hi Carrot, I'm wondering if you have posted this change on Phabricator and/or committed the change.

http://reviews.llvm.org/D14596

nemanjai commented 8 years ago

Hi Carrot, I'm wondering if you have posted this change on Phabricator and/or committed the change.

hfinkel commented 8 years ago

I have another patch that can also fix this problem. The function SROA.cpp:findCommonType always looks for integer type if there are multiple different types, this is not reasonable. My patch looks for the most frequently used type, it should be much better for performance. Any comments?

I think this looks very promising. Could you please post it for review?

...

hfinkel commented 8 years ago

For PowerPC, we could clean up the resulting unnecessary use of floating-point loads and stores as a peephole either at the DAG or the MI level. If a float load has a single reached use, which is a float store with a single reaching definition, the pair can be replaced with an integer load/store. This might be better than trying to impose a full set of correctness conditions in the middle end, especially since these may rest on target- and subtarget-dependent information.

Is this in response to my question regarding single-element structures? If so, my concern was not whether including them, or not, was optimal for PPC or any other target, but rather, that our canonical forms are already complicated, and introducing yet more special cases without a clear rationale is undesirable.

Of course, we would want to consider effects on the other major ports.

weiguozhi commented 8 years ago

I have another patch that can also fix this problem. The function SROA.cpp:findCommonType always looks for integer type if there are multiple different types, this is not reasonable. My patch looks for the most frequently used type, it should be much better for performance. Any comments?

Index: lib/Transforms/Scalar/SROA.cpp

--- lib/Transforms/Scalar/SROA.cpp (revision 252111) +++ lib/Transforms/Scalar/SROA.cpp (working copy) @@ -1073,9 +1073,7 @@ static Type *findCommonType(AllocaSlices::const_iterator B, AllocaSlices::const_iterator E, uint64_t EndOffset) {

Type *Ty = nullptr;
bool TyIsCommon = true;
IntegerType *ITy = nullptr;
SmallDenseMap<Type *, int> TypeCounters;

// Note that we need to look at every alloca slice's Use to ensure we // always get consistent results regardless of the order of slices. @@ -1101,22 +1099,30 @@ if (UserITy->getBitWidth() % 8 != 0 || UserITy->getBitWidth() / 8 > (EndOffset - B->beginOffset())) continue;
// Track the largest bitwidth integer type used in this way in case there
// is no common type.
if (!ITy || ITy->getBitWidth() < UserITy->getBitWidth())
ITy = UserITy; }
// To avoid depending on the order of slices, Ty and TyIsCommon must not
// depend on types skipped above.
if (!UserTy || (Ty && Ty != UserTy))
TyIsCommon = false; // Give up on anything but an iN type.
else
Ty = UserTy;
if (UserTy)
++TypeCounters[UserTy]; }
return TyIsCommon ? Ty : ITy;
// Look for the type that is most frequently used.
Type *Ty = nullptr;
int max_counter = 0;
for (auto I = TypeCounters.begin(); I != TypeCounters.end(); ++I) {
IntegerType *NewITy = dyn_cast_or_null(I->getFirst());
if (I->getSecond() > max_counter) {
Ty = I->getFirst();
max_counter = I->getSecond();
} else if (I->getSecond() == max_counter && NewITy) {
IntegerType *MaxITy = dyn_cast_or_null(Ty);
if (!MaxITy || MaxITy->getBitWidth() < NewITy->getBitWidth()) {
Ty = I->getFirst();
max_counter = I->getSecond();
}
}
}
return Ty; }

/// PHI instructions that use an alloca and are subsequently loaded can be

llvmbot commented 8 years ago

For PowerPC, we could clean up the resulting unnecessary use of floating-point loads and stores as a peephole either at the DAG or the MI level. If a float load has a single reached use, which is a float store with a single reaching definition, the pair can be replaced with an integer load/store. This might be better than trying to impose a full set of correctness conditions in the middle end, especially since these may rest on target- and subtarget-dependent information.

Of course, we would want to consider effects on the other major ports.

hfinkel commented 8 years ago

I've suggested this fix to Hal. Just waiting to hear back. Index: lib/Transforms/InstCombine/InstCombineCalls.cpp

--- lib/Transforms/InstCombine/InstCombineCalls.cpp (revision 252193) +++ lib/Transforms/InstCombine/InstCombineCalls.cpp (working copy) @@ -112,6 +112,19 @@ Instruction *InstCombiner::SimplifyMemTransfer(Mem // down through these levels if so. SrcETy = reduceToSingleValueType(SrcETy);

// It is highly unlikely that converting a memcpy of a structure that has

// a float into an i64 load will be profitable since later uses of the

// value will likely involve bitcasts back to float

if (StructType *STy = dyn_cast(SrcETy)) {

unsigned NumElements = STy->getNumElements();

if (NumElements > 1) {

for (unsigned ElNum = 0; ElNum < NumElements; ElNum++)

if (STy->getElementType(ElNum)->isFloatingPointTy()) {

return nullptr;

}

}

}

if (SrcETy->isSingleValueType()) { NewSrcPtrTy = PointerType::get(SrcETy, SrcAddrSp); NewDstPtrTy = PointerType::get(SrcETy, DstAddrSp);

If the variable is not used in a floating point operation in other places, or the structure is not replaced by scalars later, copying through integer registers is actually profitable. Think about power8, integer register load to use latency is 2 cycles, fp register load to use latency is 5 cycles, integer register load can be handled by 4 function units, but fp load can only be handled in 2 units. Can these cases be handled correctly?

A few comments:

Why are you excluding single-element structures?
As Carrot mentioned, we might want to check for floating-point uses of either the source or destination floating-point members of the structure.
Maybe, instead of bailing out, we should use a floating-point load/store for the floating-point members and an integer load/store for any remainder?

weiguozhi commented 8 years ago

I've suggested this fix to Hal. Just waiting to hear back. Index: lib/Transforms/InstCombine/InstCombineCalls.cpp

--- lib/Transforms/InstCombine/InstCombineCalls.cpp (revision 252193) +++ lib/Transforms/InstCombine/InstCombineCalls.cpp (working copy) @@ -112,6 +112,19 @@ Instruction *InstCombiner::SimplifyMemTransfer(Mem // down through these levels if so. SrcETy = reduceToSingleValueType(SrcETy);

// It is highly unlikely that converting a memcpy of a structure that has

// a float into an i64 load will be profitable since later uses of the

// value will likely involve bitcasts back to float

if (StructType *STy = dyn_cast(SrcETy)) {

unsigned NumElements = STy->getNumElements();

if (NumElements > 1) {

for (unsigned ElNum = 0; ElNum < NumElements; ElNum++)

if (STy->getElementType(ElNum)->isFloatingPointTy()) {

return nullptr;

}

}

}

if (SrcETy->isSingleValueType()) { NewSrcPtrTy = PointerType::get(SrcETy, SrcAddrSp); NewDstPtrTy = PointerType::get(SrcETy, DstAddrSp);

If the variable is not used in a floating point operation in other places, or the structure is not replaced by scalars later, copying through integer registers is actually profitable. Think about power8, integer register load to use latency is 2 cycles, fp register load to use latency is 5 cycles, integer register load can be handled by 4 function units, but fp load can only be handled in 2 units. Can these cases be handled correctly?

nemanjai commented 8 years ago

I've suggested this fix to Hal. Just waiting to hear back. Index: lib/Transforms/InstCombine/InstCombineCalls.cpp

--- lib/Transforms/InstCombine/InstCombineCalls.cpp (revision 252193) +++ lib/Transforms/InstCombine/InstCombineCalls.cpp (working copy) @@ -112,6 +112,19 @@ Instruction *InstCombiner::SimplifyMemTransfer(Mem // down through these levels if so. SrcETy = reduceToSingleValueType(SrcETy);

// It is highly unlikely that converting a memcpy of a structure that has
// a float into an i64 load will be profitable since later uses of the
// value will likely involve bitcasts back to float
if (StructType *STy = dyn_cast(SrcETy)) {
unsigned NumElements = STy->getNumElements();
if (NumElements > 1) {
for (unsigned ElNum = 0; ElNum < NumElements; ElNum++)
if (STy->getElementType(ElNum)->isFloatingPointTy()) {
return nullptr;
}
}
}
if (SrcETy->isSingleValueType()) { NewSrcPtrTy = PointerType::get(SrcETy, SrcAddrSp); NewDstPtrTy = PointerType::get(SrcETy, DstAddrSp);

weiguozhi commented 8 years ago

Hm. Translating something that small to a memcpy seems like a poor choice when it contains floating-point components. Perhaps that translation to memcpy needs to be rethought under those conditions.

I think the memcpy itself is not a big problem, since later it is usually lowered to register ld/st later. And integer register ld/st may be faster than fp register ld/st in some targets ( power8 :) ).

The problem is in the case of different types of accesses to the same var, how to choose an appropriate type. Current implementation of findCommonType does a bad choice for this testcase and more general situations. It always choose the integer type when there are multiple types.

An ideal choice should generates fastest code. It needs to compute the cost of each possible type.

A much simpler method can simply count the number of accesses of different types, and choose the one with largest counter.

As a practical matter, you might do well just ignoring bitcasts that feed memcpys. The memcpy is canonical, but it is not really "typed", and a bitcast that feeds the memcpy is not really contributing any "type" information (the backend will independently pick the type used to expand the memcpy (and memset, etc.) intrinsic anyway based on heuristics looking at the size, alignment, etc.).

This is reasonable, but it can't be used here, because before SROA the memcpy has been lowered to integer register ld/st.

hfinkel commented 8 years ago

Hm. Translating something that small to a memcpy seems like a poor choice when it contains floating-point components. Perhaps that translation to memcpy needs to be rethought under those conditions.

I think the memcpy itself is not a big problem, since later it is usually lowered to register ld/st later. And integer register ld/st may be faster than fp register ld/st in some targets ( power8 :) ).

The problem is in the case of different types of accesses to the same var, how to choose an appropriate type. Current implementation of findCommonType does a bad choice for this testcase and more general situations. It always choose the integer type when there are multiple types.

An ideal choice should generates fastest code. It needs to compute the cost of each possible type.

A much simpler method can simply count the number of accesses of different types, and choose the one with largest counter.

As a practical matter, you might do well just ignoring bitcasts that feed memcpys. The memcpy is canonical, but it is not really "typed", and a bitcast that feeds the memcpy is not really contributing any "type" information (the backend will independently pick the type used to expand the memcpy (and memset, etc.) intrinsic anyway based on heuristics looking at the size, alignment, etc.).

We might also want to do some (target-cost-weighted) counting, but the memcpy issue seems independent even of that as an important special case.

weiguozhi commented 8 years ago

Hm. Translating something that small to a memcpy seems like a poor choice when it contains floating-point components. Perhaps that translation to memcpy needs to be rethought under those conditions.

I think the memcpy itself is not a big problem, since later it is usually lowered to register ld/st later. And integer register ld/st may be faster than fp register ld/st in some targets ( power8 :) ).

The problem is in the case of different types of accesses to the same var, how to choose an appropriate type. Current implementation of findCommonType does a bad choice for this testcase and more general situations. It always choose the integer type when there are multiple types.

An ideal choice should generates fastest code. It needs to compute the cost of each possible type.

A much simpler method can simply count the number of accesses of different types, and choose the one with largest counter.

llvmbot commented 8 years ago

Hm. Translating something that small to a memcpy seems like a poor choice when it contains floating-point components. Perhaps that translation to memcpy needs to be rethought under those conditions.

weiguozhi commented 8 years ago

The different types exist from first.

The related source code:

  std::complex<float> sum = 0;
  for (int j = 0; j < num_columns; ++j) {
    sum += ComplexMultiply(matrix_i[j], vector_[j]);      // A
  }   
  result_[i] = sum;                                       // B

At first, sum is allocated with the declared type

%sum = alloca %"struct.std::complex", align 4

and line A is translated to

%call18 = call dereferenceable(8) %"struct.std::complex" @_ZNSt7complexIfEpLIfEERS0_RKS_IT_E(%"struct.std::complex" %sum, %"struct.std::complex"* dereferenceable(8) %ref.tmp)

line B is a copy, llvm translates it to a memcpy, the address of sum is casted to i8*, so we have different types in different uses from first.

%20 = bitcast %"struct.std::complex" %sum to i8 call void @llvm.memcpy.p0i8.p0i8.i64(i8 %19, i8 %20, i64 8, i32 4, i1 false)

Later the memcpy function call is lowered to direct register load and store, here integer register is used.

%30 = bitcast %"struct.std::complex" %sum to i64 %31 = bitcast %"struct.std::complex" %29 to i64 %32 = load i64, i64 %30, align 8 store i64 %32, i64 %31, align 4

nemanjai commented 8 years ago

For this particular test case, SROA breaks the aggregate into a pair of integers which causes this behaviour. If line 1114:

TyIsCommon = false; // Give up on anything but an iN type.

of lib/Transforms/Scalar/SROA.cpp is commented out, the performance is on par with what gcc generates. Of course, it isn't safe to remove that line so I am investigating what leads SROA to be unable to find a common type between the slices.

weiguozhi commented 8 years ago

Target armv8 also has the problem

    ldr             s0, [x17]
    ldr             s1, [x16]
    ldur    s2, [x17, #-4] 
    ldur    s3, [x16, #-4] 
    fmul    s4, s0, s1
    fmul    s1, s2, s1
    fnmsub  s2, s2, s3, s4
    fmov    s4, w1                       // A
    fmadd   s0, s0, s3, s1
    fmov    s1, w15                      // B
    sub     w18, w18, #&#8203;1           
    fadd    s2, s4, s2
    fadd    s0, s1, s0
    add     x16, x16, #&#8203;8            
    fmov    w1, s2                       // C
    fmov    w15, s0                      // D
    add     x17, x17, #&#8203;8            
    cbnz    w18, .LBB0_5
    b       .LBB0_7

Instructions AB move the value from integer registers to fp registers, instructions CD move the value from fp registers to integer registers.

weiguozhi commented 8 years ago

Target x86_64 has the same problem

    movss   -4(%rdx), %xmm0         # xmm0 = mem[0],zero,zero,zero
    movss   (%rdx), %xmm1           # xmm1 = mem[0],zero,zero,zero
    movss   -4(%rcx), %xmm2         # xmm2 = mem[0],zero,zero,zero
    movss   (%rcx), %xmm3           # xmm3 = mem[0],zero,zero,zero
    movaps  %xmm0, %xmm4
    mulss   %xmm2, %xmm4
    movaps  %xmm1, %xmm5
    mulss   %xmm3, %xmm5
    subss   %xmm5, %xmm4
    mulss   %xmm2, %xmm1
    mulss   %xmm3, %xmm0
    addss   %xmm1, %xmm0
    movd    %esi, %xmm1              // A
    addss   %xmm4, %xmm1
    movd    %xmm1, %esi              // C
    movd    %eax, %xmm1              // B
    addss   %xmm0, %xmm1
    movd    %xmm1, %eax              // D
    addq    $8, %rcx 
    addq    $8, %rdx 
    decl    %ebx
    jne     .LBB0_5

The variable std::complex sum is allocated to esi:eax, instructions AB moves them to xmm registers, instructions CD move them back to integer registers. Because they don't need to access memory, so it has much smaller performance penalty on x86.

hfinkel commented 8 years ago

At first variable sum is allocated with:

%sum = alloca %"struct.std::complex", align 4

After instruction combine pass, it is changed to

%sum = alloca i64, align 8

And in pass SROA, it is replaced by two integer registers.

So is instruction combine pass the culprit?

It might be; I've also run across similar things recently. We may just need the backend to look through int-to-float bitcasts (and some truncs/shifts) if we're choosing the integer types as the canonical form here.

weiguozhi commented 8 years ago

At first variable sum is allocated with:

%sum = alloca %"struct.std::complex", align 4

After instruction combine pass, it is changed to

%sum = alloca i64, align 8

And in pass SROA, it is replaced by two integer registers.

So is instruction combine pass the culprit?

weiguozhi commented 8 years ago

assigned to @weiguozhi

llvm / llvm-project

[ppc] floating point var allocated to integer register #25716

Extended Description

Index: lib/Transforms/Scalar/SROA.cpp

I've suggested this fix to Hal. Just waiting to hear back. Index: lib/Transforms/InstCombine/InstCombineCalls.cpp

I've suggested this fix to Hal. Just waiting to hear back. Index: lib/Transforms/InstCombine/InstCombineCalls.cpp

I've suggested this fix to Hal. Just waiting to hear back. Index: lib/Transforms/InstCombine/InstCombineCalls.cpp