Inefficient code for fp16 vectors

pirama-arumuga-nainar commented 8 years ago


Bugzilla Link	27222
Version	trunk
OS	Linux
CC	@ahmedbougacha,@asl,@stephenhines

Extended Description

We generate inefficient code for half vectors for some architectures. Consider the following IR:

define void @add_h(<4 x half> %a, <4 x half> %b) { entry: %x = load <4 x half>, <4 x half> %a, align 8 %y = load <4 x half>, <4 x half> %b, align 8 %0 = fadd <4 x half> %x, %y store <4 x half> %0, <4 x half>* %a ret void }

LLVM currently splits and scalarizes vectors. IOW, it splits the <4 x half> into 4 half datum and operates individually on them. This prevents the backend from selecting vector load and vector conversion instructions. The code generated has repeated 16-byte loads, converstion to fp32, addition, conversion back to fp16 and a 16-byte store.

Here's the code generated for ARM32: ldrh r4, [r1, #6] ldrh r3, [r0, #6] ldrh r12, [r1] ldrh r2, [r0, #4] ldrh lr, [r0, #2] vmov s0, r4 ldrh r4, [r1, #2] ldrh r1, [r1, #4] vmov s2, r3 ldrh r3, [r0] vmov s6, r2 vmov s10, lr vmov s12, r12 vcvtb.f32.f16 s0, s0 vcvtb.f32.f16 s2, s2 vadd.f32 s0, s2, s0 vmov s4, r1 vmov s8, r4 vmov s14, r3 vcvtb.f32.f16 s4, s4 vcvtb.f32.f16 s6, s6 vcvtb.f32.f16 s2, s8 vcvtb.f32.f16 s8, s10 vcvtb.f32.f16 s10, s12 vcvtb.f32.f16 s12, s14 vcvtb.f16.f32 s0, s0 vadd.f32 s4, s6, s4 vadd.f32 s2, s8, s2 vadd.f32 s6, s12, s10 vmov r1, s0 vcvtb.f16.f32 s4, s4 vcvtb.f16.f32 s0, s2 vcvtb.f16.f32 s2, s6 strh r1, [r0, #6] vmov r1, s4 strh r1, [r0, #4] vmov r1, s0 strh r1, [r0, #2] vmov r1, s2 strh r1, [r0]

In comparison, the same code gets translated to the following for AArch64: ldr d0, [x1] ldr d1, [x0] fcvtl v0.4s, v0.4h fcvtl v1.4s, v1.4h fadd v0.4s, v1.4s, v0.4s fcvtn v0.4h, v0.4s str d0, [x0] ret .Lfunc_end0:

This happens for the architectures whose LLVM backends don't natively support half (such as x86, x86_64 and ARM32).

pirama-arumuga-nainar commented 8 years ago

The link I pasted in previous comment doesn't directly go to the instruction's reference page. See Section 5.44 in http://infocenter.arm.com/help/topic/com.arm.doc.dui0489i/DUI0489I_arm_assembler_reference.pdf.

pirama-arumuga-nainar commented 8 years ago

Hi Anton, fp16 is a storage-only type and LLVM already performs operations on fp16 data by promoting them to fp32. For ARM32 with NEON and the 'half' feature, it'd be more efficient to use the vector-variant of VCVT (http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489i/Bcfjicfj.html) instead of the scalar variant. Doing so would produce code similar to the AArch64 output in my initial comment.

asl commented 8 years ago

What would you expect to be generated on ARM32 then? fp16 is storage-only type there.

llvm / llvm-project

Inefficient code for fp16 vectors #27596

Extended Description