Closed zhenbohu closed 6 years ago
Try to build with -mexplicit-relocs I think performance can come back again.
@palmer-dabbelt it's seem because -mexplicit-relocs not enable by default, any issue for that?
@palmer-dabbelt oh I see the comment in gcc
/* We get better code with explicit relocs for CM_MEDLOW, but
worse code for the others (for now). Pick the best default. */
if ((target_flags_explicit & MASK_EXPLICIT_RELOCS) == 0)
if (riscv_cmodel == CM_MEDLOW)
target_flags |= MASK_EXPLICIT_RELOCS;
@kito-cheng
Hi, Kito
Do you mean to just add a gcc option -mexplicit-relocs (no need to rebuilt the toolchain)? I just tried to add this option as a gcc compile option. And unfortunately I see the generated code is still same as before, that kind of worse code is there generated there, and the dhrystone number is still very suck. Could you kindly give me more hint or information about this?
Thanks Bob
@zhenbohu really? could you compile follow code with -O2 + -mcmode=medany/-mcmodel=medany -mexplicit-relocs and paste the assemble code?
btw the problem 2 is pointer aliasing issue since Ptr_Ref_Par may point to Ptr_Glob and the second load is to make sure it's correct when Ptr_Ref_Par really point to Ptr_Glob (Ptr_Ref_Par == &Ptr_Glob).
#include <stddef.h>
typedef enum {Ident_1, Ident_2, Ident_3, Ident_4, Ident_5}
Enumeration;
typedef struct record
{
struct record *Ptr_Comp;
Enumeration Discr;
union {
struct {
Enumeration Enum_Comp;
int Int_Comp;
char Str_Comp [31];
} var_1;
struct {
Enumeration E_Comp_2;
char Str_2_Comp [31];
} var_2;
struct {
char Ch_1_Comp;
char Ch_2_Comp;
} var_3;
} variant;
} Rec_Type, *Rec_Pointer;
Rec_Pointer Ptr_Glob;
int Int_Glob;
Proc_3 (Ptr_Ref_Par)
/******************/
/* executed once */
/* Ptr_Ref_Par becomes Ptr_Glob */
Rec_Pointer *Ptr_Ref_Par;
{
if (Ptr_Glob != NULL)
/* then, executed */
*Ptr_Ref_Par = Ptr_Glob->Ptr_Comp;
Proc_7 (10, Int_Glob, &Ptr_Glob->variant.var_1.Int_Comp);
} /* Proc_3 */
Compile with -mcmode=medany:
...
Proc_3:
add sp,sp,-16
sw ra,12(sp)
lla a5,Ptr_Glob
lw a5,0(a5)
...
Compile with -O2 -mcmode=medany -mexplicit-relocs:
...
Proc_3:
add sp,sp,-16
sw ra,12(sp)
.LA0: auipc a5,%pcrel_hi(Ptr_Glob)
lw a5,%pcrel_lo(.LA0)(a5)
...
@kito-cheng
You can see my script sniplet:
/home/zhenbohu/jx_work/freedom-e-sdk/work/build/riscv-gnu-toolchain/riscv64-unknown-elf/prefix/bin/riscv64-unknown-elf-gcc -Os -fno-common -mcmodel=medany -march=rv32imc -mabi=ilp32 -g -march=rv32imc -mabi=ilp32 -mcmodel=medany -mexplicit-relocs -ffunction-sections -fdata-sections -fno-builtin-printf -fno-builtin-malloc -O2 -mcmodel=medany -mexplicit-relocs -DTIME -fno-inline -fno-builtin-printf -Wno-implicit -march=rv32imc -mabi=ilp32 -c -o dhry_1.o dhry_1.c
......
/home/zhenbohu/jx_work/freedom-e-sdk/work/build/riscv-gnu-toolchain/riscv64-unknown-elf/prefix/bin/riscv64-unknown-elf-gcc -Os -fno-common -mcmodel=medany -march=rv32imc -mabi=ilp32 -g -march=rv32imc -mabi=ilp32 -mcmodel=medany -mexplicit-relocs -ffunction-sections -fdata-sections -fno-builtin-printf -fno-builtin-malloc -I/home/zhenbohu/jx_work/freedom-e-sdk/bsp/include -I/home/zhenbohu/jx_work/freedom-e-sdk/bsp/drivers/ -I/home/zhenbohu/jx_work/freedom-e-sdk/bsp/env -I/home/zhenbohu/jx_work/freedom-e-sdk/bsp/env/sirv-e203-arty dhry_1.o dhry_2.o /home/zhenbohu/jx_work/freedom-e-sdk/bsp/env/start.o /home/zhenbohu/jx_work/freedom-e-sdk/bsp/env/entry.o dhry_stubs.o /home/zhenbohu/jx_work/freedom-e-sdk/bsp/env/sirv-e203-arty/init.o /home/zhenbohu/jx_work/freedom-e-sdk/bsp/env/sirv_printf.o -o dhrystone -Wl,--wrap=scanf -Wl,--wrap=printf -march=rv32imc -mabi=ilp32 -mcmodel=medany -Wl,--wrap=malloc -Wl,--wrap=free -Wl,--wrap=open -Wl,--wrap=lseek -Wl,--wrap=read -Wl,--wrap=write -Wl,--wrap=fstat -Wl,--wrap=stat -Wl,--wrap=close -Wl,--wrap=link -Wl,--wrap=unlink -Wl,--wrap=execve -Wl,--wrap=fork -Wl,--wrap=getpid -Wl,--wrap=kill -Wl,--wrap=wait -Wl,--wrap=isatty -Wl,--wrap=times -Wl,--wrap=sbrk -Wl,--wrap=_exit -L. -Wl,--start-group -lwrap -lc -Wl,--end-group -T /home/zhenbohu/jx_work/freedom-e-sdk/bsp/env/sirv-e203-arty/link.lds -nostartfiles -Wl,--gc-sections -Wl,--wrap=scanf -Wl,--wrap=malloc -Wl,--wrap=printf -Wl,--check-sections -L/home/zhenbohu/jx_work/freedom-e-sdk/bsp/env
And then I dump it:
8000073e
Can you see any clues from it?
Thanks Bob
Hmmmm, I don't know why -mexplicit-relocs don't improve your code gen, but maybe you can try to build with -mcmodel=medlow?
@kito-cheng
Hi, Kito
Thanks for your info, after I use the -medlow you suggested, the generated code is:
80000650
Looks like the instruction count is reduced much, and much better now.
What is the difference of medlow and medany? which option should I to use exactly? will it impact any functionality correctness?
Thanks Bob
I just added documentation: https://github.com/riscv/riscv-gcc/commit/efffc4465762e6a7533afaeec6f78ee5f838b374
I'd also expect -mexplicit-relocs to fix your problem, and when combined with the latest binutils that relaxes auipc+load sequinces to gp-relative loads that you should get the same performance in medlow and medany mode.
@palmer-dabbelt @aswaterman Just off topic, it's seem medany have no any benefit on RV32, how about alias medany to medlow for RV32?
@kito-cheng medany is useful for RV32 for some specialized code (e.g. low-level boot code that is almost-PIC)
@kito-cheng
Hi, Kito
After I switch to medlow and compile my dhrystone code and excute it on board, the result is strangely incorrect, but I cannot root the cause out. I dont know why it is incorrect.
And BTW: I noticed in freedom-e-sdk (by Sifive) software demo, the dhrystone program have its Makefile explicitely used option -mcmodel=medany.
Do you have any rough idea, why the dhrystone (from freedom-e-sdk) is explicitly use option mcmodel=medany. And why if I changed it to medlow, the result is incorrect?
Thanks Bob
@zhenbohu I am not SiFive guys, so the reason why -mcmodel=medany is enable by default in freedom-e-sdk you may ask @aswaterman or @palmer-dabbelt
Our 64-bit cores have the scratchpad at 0x80000000, which is above the region that can be addressed by the medlow code model.
Dear Experts
Many folks include myself are noticing the performance dropping by using latest toolchain compared with old GCC6.1.0 version, not only about the benchmark but also about the code size.
Not sure if this problem is on the way to be fixed? could you kindly share some of your information?
Thanks Bob
@kito-cheng
Hi, Kito Do you mind to share some of information about this?
Thanks Bob
@palmer-dabbelt @aswaterman
Hi, Palmer, Andrew
Do you guys have some information to share?
Thanks Bob
We are working to improve code generation. But it's not a general issue; it's a large number of small issues. If you can file new issues with specific examples of worse code generation, we can try to improve the compiler.
It looks like the only problem here is accidental use of -mcmodel=medany when they wanted -mcmodel=medlow code.
FWIW: medany is fast now, so that's not even a problem any more :)
Hi,
I was using the RISCV built GCC toolchain several months ago (which is based on GCC 6.1.0 version). Recently I have upgraded my database and use the latest built GCC toolchain (which is based on GCC 7.1.0). But unfortunately after switching to this new version, I found my dhrystone benchmark number decreased very much (from around 1.3DMIPS/MHz to 1.0DMIPS/MHz, about 30% dropped, this is really a big gap).
Since I am a Hardware guys and not a compiler expert, I cannt identify what is the root cause of this degradation, but I just tried to use two different versions to generated the elf, and diff their Dump files, and I found an interesting obvious defects in the code generated by new version of toolchain (7.1.0). Please see the sniplets (for the same function generated by two different version of toolchain) I copied at below:
Old Version of ToolChain generated code (gcc 6.1.0) which have better performance: 800007de:
800007de: 10000617 auipc a2,0x10000
800007e2: c9262603 lw a2,-878(a2) # 90000470
800007e6: c619 beqz a2,800007f4 <Proc_3+0x16>
800007e8: 421c lw a5,0(a2)
800007ea: c11c sw a5,0(a0)
800007ec: 10000617 auipc a2,0x10000
800007f0: c8462603 lw a2,-892(a2) # 90000470
800007f4: 0631 addi a2,a2,12
800007f6: 10000597 auipc a1,0x10000
800007fa: c6e5a583 lw a1,-914(a1) # 90000464
800007fe: 4529 li a0,10
80000800: a201 j 80000900
New Version of ToolChain generated code (gcc 7.1.0) which have very worse performance: 80000746:
80000746: 10000797 auipc a5,0x10000
8000074a: d2a78793 addi a5,a5,-726 # 90000470
8000074e: 4390 lw a2,0(a5)
80000750: c601 beqz a2,80000758 <Proc_3+0x12>
80000752: 4218 lw a4,0(a2)
80000754: c118 sw a4,0(a0)
80000756: 4390 lw a2,0(a5)
80000758: 10000797 auipc a5,0x10000
8000075c: d0c78793 addi a5,a5,-756 # 90000464
80000760: 438c lw a1,0(a5)
80000762: 0631 addi a2,a2,12
80000764: 4529 li a0,10
80000766: a86d j 80000820
We can see the very obvious defects in the gcc7.1.0 generated code, summarized as below:
*** Problem (1), it is using 3 instructions instead of two instructions to load a word from address, see below code, it is using LW instruction with register a5 plus a zero offset. And I noticed this kind of code sniplet is everywhere across the entire dhrystone.dump file and with very high frequency used. On the contrary, this worse code is not existed in gcc6.1.0 generated code. I guess this bad code is one of the main issue which caused the bad performance. 80000746: 10000797 auipc a5,0x10000 8000074a: d2a78793 addi a5,a5,-726 # 90000470
8000074e: 4390 lw a2,0(a5)
......
80000758: 10000797 auipc a5,0x10000
8000075c: d0c78793 addi a5,a5,-756 # 90000464
80000760: 438c lw a1,0(a5)
*** Problem (2), redudant instructions inserted, see below sniplet. This instruction is obviously not needed there, but it just inserted there with no reason. On the contrary, this redudant instruction is not existed in gcc6.1.0 generated code. I guess this bad code is also another issue which caused the bad performance. 80000756: 4390 lw a2,0(a5)
Since from the data I got, the performance is degraded very siginificantly, I dont think this is a minor issue, could you help to identify and resolve this issue? I am not sure if I reported this issue in the right place.
BTW: This similar problem have also been reported in the sifive forum, the link is here FYI: https://forums.sifive.com/t/poor-dhrystone-performance/233/48 (see the last comments discussed between "Drew" and me).
Thanks very much for your help.
Thanks Bob