Closed klpeders closed 8 years ago
There was indeed an issue with this instruction, we pushed a fix. Then for what concerns the efficiency of the code, it is typical of gcc with optimizations like auto-vectorization. If you want to see it on another architecture, you can try it on x86, you will see the same issue. The reason is that gcc has to cover corner cases and this takes some instructions. If you prefer to have a smaller code, you can use the option -fno-tree-vectorize to get the following code: fill_ffff: lw a5,0(a1) add sp,sp,-1024 blez a5,.L4 sll a5,a5,1 add a5,a5,-2 srl a5,a5,1 mv a4,sp li a3,-1 add a5,a5,1 lp.setup x1,a5,(.L9) # loop setup, lc+le set .L3: p.sh a3,2(a4!) # store post inc .L9: nop /* loop end a5 .L3 */ .L4: lhu a5,84(sp) add sp,sp,1024 sh a5,0(a0) jr ra
I know how to disable the vectorizer, but that was not the game! The game was to see what kind of performance we could expect on some layer 1 code that currently runs on a home-grown CPU. Best case for ri5cy with hw-loop and the vector extensions is very close to our pretty basic MIPS-like CPU. That is completely unexpected, so I looked into why that happened.
IMHO, the code for -O3 looks so bad that it looks like sabotage.
N>>1
calculated as (N-2)>>1 + 1
?N < 4
calculated as N-1 < 3
?N < 4
? The fill-loop manage that just fine, and faster. Conditionally moving the last 0xffff would be much simpler than the unrolled
copy of 1 to 3 words.L18
- a1
is up-to-date, but it is recalculated from index and sp
!L18
, a6 is 0xffffffff, but it is recreated in a2.Many times through the code this macro(?) is used:
sll a3,a3,1
add a1,sp,1024
add a3,a1,a3
sh a2,-1024(a3)
the +1024 and -1024 make no sense:
sll a3,a3,1
add a3,sp,a3
sh a2,0(a3)
Funny, that is mostly one MIPS instruction, LSA
:
lsa a3,a3,sp,1
sh a2,0(a3)
Given the MIPS legacy I checked the code from some different versions of MIPS toolschains.
# GCC 5.1
mipsel-elf32-gcc -O2 -march=mips32r6 -mtune=mips32r6 -ftree-loop-vectorize -fno-builtin
That generates (initialization and 'vectorized' loop):
lw $3,0($5)
beq $3,$0,$L8
addiu $sp,$sp,-1024
addiu $5,$3,-2
srl $5,$5,1
addiu $6,$3,-1
addiu $5,$5,1
sltu $6,$6,3
bne $6,$0,$L9
sll $2,$5,1
move $7,$sp
li $9,-1 # 0xffffffffffffffff
$L4:
addiu $6,$6,1
sltu $8,$6,$5
sw $9,0($7)
bne $8,$0,$L4
addiu $7,$7,4
[...]
$L3:
lsa $2,$2,$sp,1
lsa $3,$3,$sp,1
li $5,-1 # 0xffffffffffffffff
$L6:
sh $5,0($2)
addiu $2,$2,2
bne $2,$3,$L6
nop
That is pretty much identical to the the Pulp code. So the problem seem to be upstream(!)
There is a number of problems:
// Entry: $3 = *N; ($3 > 0)
addiu $5,$3,-2
srl $5,$5,1
addiu $5,$5,1
Those 3 lines are identical to:
// Entry: $3 = *N; ($3 > 0)
srl $5,$3,1
The next few lines is just as strange, (N-1 < 3) really?:
// Entry: $3 = *N; ($3 > 0)
addiu $6,$3,-1
sltu $6,$6,3
bne $6,$0,$L9
sll $2,$5,1
Better:
// Entry: $3 = *N; ($3 > 0)
sltu $6,$3,4
bne $6,$0,$L9
sll $2,$5,1
But why special case for N less than 4? Makes no sense unless it is some 64bit artefact.
Actually these strange comparisons are there to avoid some overflows with corner cases. Then indeed there are some dummy stuff that could be optimized out. This probably comes from the original porting done for riscv which supports lot of variants.
Can you give an example of such corner case? The most interesting I can find is *N=1. But that that works at all looks more like an accident!
$5 is 0x80000000 after this for N=0 and N=1: addiu $5,$3,-2 srl $5,$5,1 addiu $5,$5,1
The attached test-case is mis-compiled when gcc is targeting pulpv2 with -O3
The code is a extracted from real code that do real stuff :-p
Here is the mis-compiled part:
BTW, the generated code is extremely convoluted! I don't think I have ever seen anything this poor at -O3 before, for example:
This sequence of instructions is copied 3 times, macro?
Full code: