llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.83k stars 11.46k forks source link

Autovectorization of a manually unrolled loop results in wide, interleaved vectors #39836

Open llvmbot opened 5 years ago

llvmbot commented 5 years ago
Bugzilla Link 40490
Version trunk
OS All
Reporter LLVM Bugzilla Contributor
CC @hfinkel
llvmbot commented 5 years ago

Gah, stupid enter button kept submitting my changes before I wanted them to.

LLVM simply cannot vectorize an unrolled loop.

void add_unroll_while_reverse_ptradd(unsigned restrict p1, const unsigned restrict p2) { const unsigned const end = p1 + 1024; while (p1 != end) { p1++ += p2++; p1++ += p2++; p1++ += p2++; p1++ += *p2++; } }

Expected code output (ARM NEON, affects other platforms) would be something like this: add_unroll_while_reverse_ptradd: mov r2, #​0 .LBB0_1 add r3, r1, r2 vld1.32 {d16, d17}, [r3] add r3, r0, r2 add r2, r2, #​16 vld1.32 {d18, d19}, [r3] cmp r2, #​4096 vadd.i32 q8, q9, q8 vst1.32 {d16, d17}, [r3] bne .LBB0_1 bx lr

Actual output:

add_unroll_while_reverse_ptradd: push {r4, r5, r6, r7, r8, r9, r11, lr} add r2, r1, #​4096 cmp r2, r0 addhi r2, r0, #​4096 cmphi r2, r1 bhi .LBB4_3 mov lr, #​0 .LBB4_2: add r2, r1, lr add r12, r0, lr add lr, lr, #​64 mov r3, r2 mov r4, r12 vld1.32 {d16, d17}, [r3]! cmp lr, #​4096 vld1.32 {d18, d19}, [r4]! vadd.i32 q8, q9, q8 vld1.32 {d20, d21}, [r3] add r3, r2, #​48 add r2, r2, #​32 vld1.32 {d22, d23}, [r3] add r3, r12, #​48 vld1.32 {d26, d27}, [r2] add r2, r12, #​32 vld1.32 {d28, d29}, [r4] vadd.i32 q9, q14, q10 vld1.32 {d20, d21}, [r2] vadd.i32 q10, q10, q13 vld1.32 {d24, d25}, [r3] vorr q13, q8, q8 vadd.i32 q11, q12, q11 vorr q12, q9, q9 vorr q14, q10, q10 vorr q15, q11, q11 vtrn.32 q13, q12 vtrn.32 q14, q15 vorr q14, q10, q10 vext.32 q0, q12, q8, #​2 vzip.32 q14, q11 vzip.32 q8, q9 vext.32 q1, q10, q15, #​2 vext.32 q8, q10, q11, #​2 vext.32 q11, q15, q0, #​2 vext.32 q12, q1, q12, #​2 vext.32 q3, q11, q11, #​2 vext.32 q10, q10, q14, #​2 vext.32 q8, q8, q9, #​2 vext.32 q1, q12, q12, #​2 vext.32 q9, q10, q13, #​2 vext.32 q2, q8, q8, #​2 vext.32 q0, q9, q9, #​2 vst4.32 {d0, d2, d4, d6}, [r12]! vst4.32 {d1, d3, d5, d7}, [r12] bne .LBB4_2 b .LBB4_5 .LBB4_3: mov r2, #​0 .LBB4_4: mov r3, r0 mov r4, r1 ldr r12, [r3, r2, lsl #​2]! ldr lr, [r4, r2, lsl #​2]! add r2, r2, #​4 ldmib r3, {r8, r9} cmp r2, #​1024 add r12, r12, lr ldr r7, [r3, #​12] ldmib r4, {r5, r6} ldr r4, [r4, #​12] add r5, r8, r5 add r6, r9, r6 str r12, [r3] add r7, r7, r4 stmib r3, {r5, r6, r7} bne .LBB4_4 .LBB4_5: pop {r4, r5, r6, r7, r8, r9, r11, pc}

Instead of rerolling this loop to vectorize it into a load+load+add+store, Clang will generate a 512-bit vector and interleave it with a number of shuffles, as evident by the LLVM output:

define dso_local void @​add_unroll_while_reverse(i32 noalias nocapture, i32 noalias nocapture readonly) local_unnamed_addr #​0 { %3 = getelementptr i32, i32 %0, i32 1024 %4 = getelementptr i32, i32 %1, i32 1024 %5 = icmp ugt i32 %4, %0 %6 = icmp ugt i32 %3, %1 %7 = and i1 %5, %6 br i1 %7, label %34, label %8

;

;

;

This is ARM NEON, there are no 512-bit vectors.