Error: illegal operands `pv.add.sci.h a6,x0,65535'

klpeders commented 8 years ago

The attached test-case is mis-compiled when gcc is targeting pulpv2 with -O3

riscv32-unknown-elf-gcc -c -Wall -fno-builtin -O3 -m32 \ -march=IMXpulpv2 -Wa,-march=IMXpulpv2 pv-addi-regress.c xx.s: Assembler messages: xx.s:18: Error: illegal operands `pv.add.sci.h a6,x0,65535'

The code is a extracted from real code that do real stuff :-p

// File: pv-addi-regress.c

// The following test-case is mis-compiled when gcc is called as follows:
//
//   riscv32-unknown-elf-gcc -c -Wall -fno-builtin -O3 -m32 \
//      -march=IMXpulpv2 -Wa,-march=IMXpulpv2  pv-addi-regress.c
//   xx.s: Assembler messages:
//   xx.s:18: Error: illegal operands `pv.add.sci.h a6,x0,65535'
// 
// According to riscv-opc.c, the constant in the pv.add.sci.h instruction
// is of type 'bs', which is signed.

void fill_ffff(short *escape,int *N)
{
    int i;  
    short A[512];
    for (i=0; i<*N; i++)
        A[i] = 0xffff;
    *escape = A[42];
}

Here is the mis-compiled part:

    pv.add.sci.h    a6,x0,65535
    pv.ball a2,.L11
.L19:
    lp.setup    x1,a3,(.L18)     # loop setup, lc+le set
.L4:
    p.sw    a6,4(a1!)   # store post inc
.L18:
    nop
    /* loop end a3 .L4 */

BTW, the generated code is extremely convoluted! I don't think I have ever seen anything this poor at -O3 before, for example:

.L9:
    li  a5,0        // (a5=0)
    add a3,sp,1024  // (a3=sp+1024)
    sll a1,a5,1     // (a1=0)
    add a1,a3,a1    // (a1=sp+1024)
    li  a2,-1       // (a2=-1)
    add a3,a5,1     // (a3=1)
    sh  a2,-1024(a1) // (a2=*[sp+1024-1024])
    ble a4,a3,.L8   // (N<=1)?
    j   .L17

This sequence of instructions is copied 3 times, macro?

    add a3,sp,1024
    sll a1,a5,1
    add a1,a3,a1
    [...]
    sh  a2,-1024(a1)

Full code:

fill_ffff:
    lw  a4,0(a1)
    add sp,sp,-1024
    blez    a4,.L8
    add a2,a4,-2
    srl a2,a2,1
    add a3,a2,1
    add a1,a4,-1
    li  a6,2
    sll a5,a3,1
    bleu    a1,a6,.L9
    mv  a1,sp
    pv.add.sci.h    a6,x0,65535
    pv.ball a2,.L11
.L19:
    lp.setup    x1,a3,(.L18)     # loop setup, lc+le set
.L4:
    p.sw    a6,4(a1!)   # store post inc
.L18:
    nop
    /* loop end a3 .L4 */ 
    beq a4,a5,.L8
    add a3,sp,1024
    sll a1,a5,1
    add a1,a3,a1
    li  a2,-1
    add a3,a5,1
    sh  a2,-1024(a1)
    ble a4,a3,.L8
.L17:
    sll a3,a3,1
    add a1,sp,1024
    add a3,a1,a3
    add a5,a5,2
    sh  a2,-1024(a3)
    ble a4,a5,.L8
    sll a5,a5,1
    add a1,sp,1024
    add a5,a1,a5
    sh  a2,-1024(a5)
.L8:
    lhu a5,84(sp)
    add sp,sp,1024
    sh  a5,0(a0)
    jr  ra
.L9:
    li  a5,0
    add a3,sp,1024
    sll a1,a5,1
    add a1,a3,a1
    li  a2,-1
    add a3,a5,1
    sh  a2,-1024(a1)
    ble a4,a3,.L8
    j   .L17
.L11:
    li  a3,1
    j   .L19
    .size   fill_ffff, .-fill_ffff
    .ident  "GCC: (GNU) 5.2.0"

haugoug commented 8 years ago

There was indeed an issue with this instruction, we pushed a fix. Then for what concerns the efficiency of the code, it is typical of gcc with optimizations like auto-vectorization. If you want to see it on another architecture, you can try it on x86, you will see the same issue. The reason is that gcc has to cover corner cases and this takes some instructions. If you prefer to have a smaller code, you can use the option -fno-tree-vectorize to get the following code: fill_ffff: lw a5,0(a1) add sp,sp,-1024 blez a5,.L4 sll a5,a5,1 add a5,a5,-2 srl a5,a5,1 mv a4,sp li a3,-1 add a5,a5,1 lp.setup x1,a5,(.L9) # loop setup, lc+le set .L3: p.sh a3,2(a4!) # store post inc .L9: nop /* loop end a5 .L3 */ .L4: lhu a5,84(sp) add sp,sp,1024 sh a5,0(a0) jr ra

klpeders commented 8 years ago

I know how to disable the vectorizer, but that was not the game! The game was to see what kind of performance we could expect on some layer 1 code that currently runs on a home-grown CPU. Best case for ri5cy with hw-loop and the vector extensions is very close to our pretty basic MIPS-like CPU. That is completely unexpected, so I looked into why that happened.

IMHO, the code for -O3 looks so bad that it looks like sabotage.

Why is N>>1 calculated as (N-2)>>1 + 1?
Why is N < 4 calculated as N-1 < 3?
Why the special case for N < 4? The fill-loop manage that just fine, and faster. Conditionally moving the last 0xffff would be much simpler than the unrolled copy of 1 to 3 words.
After the loop at L18 - a1 is up-to-date, but it is recalculated from index and sp!
Also after L18, a6 is 0xffffffff, but it is recreated in a2.

Many times through the code this macro(?) is used:

sll a3,a3,1
add a1,sp,1024
add a3,a1,a3
sh  a2,-1024(a3)

the +1024 and -1024 make no sense:

sll a3,a3,1
add a3,sp,a3
sh  a2,0(a3)

Funny, that is mostly one MIPS instruction, LSA:

  lsa     a3,a3,sp,1
  sh      a2,0(a3)

Given the MIPS legacy I checked the code from some different versions of MIPS toolschains.

# GCC 5.1
mipsel-elf32-gcc -O2 -march=mips32r6 -mtune=mips32r6 -ftree-loop-vectorize -fno-builtin

That generates (initialization and 'vectorized' loop):

        lw      $3,0($5)
        beq     $3,$0,$L8
        addiu   $sp,$sp,-1024

        addiu   $5,$3,-2
        srl     $5,$5,1
        addiu   $6,$3,-1
        addiu   $5,$5,1
        sltu    $6,$6,3
        bne     $6,$0,$L9
        sll     $2,$5,1

        move    $7,$sp
        li      $9,-1                   # 0xffffffffffffffff
$L4:
        addiu   $6,$6,1
        sltu    $8,$6,$5
        sw      $9,0($7)
        bne     $8,$0,$L4
        addiu   $7,$7,4
[...]
$L3:
        lsa     $2,$2,$sp,1
        lsa     $3,$3,$sp,1
        li      $5,-1                   # 0xffffffffffffffff
$L6:
        sh      $5,0($2)
        addiu   $2,$2,2
        bne     $2,$3,$L6
        nop

That is pretty much identical to the the Pulp code. So the problem seem to be upstream(!)

There is a number of problems:

// Entry: $3 = *N; ($3 > 0)
        addiu   $5,$3,-2
        srl     $5,$5,1
        addiu   $5,$5,1

Those 3 lines are identical to:

// Entry: $3 = *N; ($3 > 0)
        srl     $5,$3,1

The next few lines is just as strange, (N-1 < 3) really?:

// Entry: $3 = *N; ($3 > 0)
        addiu   $6,$3,-1
        sltu    $6,$6,3
        bne     $6,$0,$L9
        sll     $2,$5,1

Better:

// Entry: $3 = *N; ($3 > 0)
        sltu    $6,$3,4
        bne     $6,$0,$L9
        sll     $2,$5,1

But why special case for N less than 4? Makes no sense unless it is some 64bit artefact.

haugoug commented 8 years ago

Actually these strange comparisons are there to avoid some overflows with corner cases. Then indeed there are some dummy stuff that could be optimized out. This probably comes from the original porting done for riscv which supports lot of variants.

klpeders commented 8 years ago

Can you give an example of such corner case? The most interesting I can find is *N=1. But that that works at all looks more like an accident!

$5 is 0x80000000 after this for N=0 and N=1: addiu $5,$3,-2 srl $5,$5,1 addiu $5,$5,1

pulp-platform / ri5cy_gnu_toolchain

Error: illegal operands `pv.add.sci.h a6,x0,65535' #1