Huge dhrystone performance degradation caused by the GCC bumped from 6.1.0 to 7.1.0 version

zhenbohu commented 7 years ago

Hi,

I was using the RISCV built GCC toolchain several months ago (which is based on GCC 6.1.0 version). Recently I have upgraded my database and use the latest built GCC toolchain (which is based on GCC 7.1.0). But unfortunately after switching to this new version, I found my dhrystone benchmark number decreased very much (from around 1.3DMIPS/MHz to 1.0DMIPS/MHz, about 30% dropped, this is really a big gap).

Since I am a Hardware guys and not a compiler expert, I cannt identify what is the root cause of this degradation, but I just tried to use two different versions to generated the elf, and diff their Dump files, and I found an interesting obvious defects in the code generated by new version of toolchain (7.1.0). Please see the sniplets (for the same function generated by two different version of toolchain) I copied at below:

Old Version of ToolChain generated code (gcc 6.1.0) which have better performance: 800007de : 800007de: 10000617 auipc a2,0x10000 800007e2: c9262603 lw a2,-878(a2) # 90000470 800007e6: c619 beqz a2,800007f4 <Proc_3+0x16> 800007e8: 421c lw a5,0(a2) 800007ea: c11c sw a5,0(a0) 800007ec: 10000617 auipc a2,0x10000 800007f0: c8462603 lw a2,-892(a2) # 90000470 800007f4: 0631 addi a2,a2,12 800007f6: 10000597 auipc a1,0x10000 800007fa: c6e5a583 lw a1,-914(a1) # 90000464 800007fe: 4529 li a0,10 80000800: a201 j 80000900

New Version of ToolChain generated code (gcc 7.1.0) which have very worse performance: 80000746 : 80000746: 10000797 auipc a5,0x10000 8000074a: d2a78793 addi a5,a5,-726 # 90000470 8000074e: 4390 lw a2,0(a5) 80000750: c601 beqz a2,80000758 <Proc_3+0x12> 80000752: 4218 lw a4,0(a2) 80000754: c118 sw a4,0(a0) 80000756: 4390 lw a2,0(a5) 80000758: 10000797 auipc a5,0x10000 8000075c: d0c78793 addi a5,a5,-756 # 90000464 80000760: 438c lw a1,0(a5) 80000762: 0631 addi a2,a2,12 80000764: 4529 li a0,10 80000766: a86d j 80000820

We can see the very obvious defects in the gcc7.1.0 generated code, summarized as below:

*** Problem (1), it is using 3 instructions instead of two instructions to load a word from address, see below code, it is using LW instruction with register a5 plus a zero offset. And I noticed this kind of code sniplet is everywhere across the entire dhrystone.dump file and with very high frequency used. On the contrary, this worse code is not existed in gcc6.1.0 generated code. I guess this bad code is one of the main issue which caused the bad performance. 80000746: 10000797 auipc a5,0x10000 8000074a: d2a78793 addi a5,a5,-726 # 90000470 8000074e: 4390 lw a2,0(a5) ...... 80000758: 10000797 auipc a5,0x10000 8000075c: d0c78793 addi a5,a5,-756 # 90000464 80000760: 438c lw a1,0(a5)

*** Problem (2), redudant instructions inserted, see below sniplet. This instruction is obviously not needed there, but it just inserted there with no reason. On the contrary, this redudant instruction is not existed in gcc6.1.0 generated code. I guess this bad code is also another issue which caused the bad performance. 80000756: 4390 lw a2,0(a5)

Since from the data I got, the performance is degraded very siginificantly, I dont think this is a minor issue, could you help to identify and resolve this issue? I am not sure if I reported this issue in the right place.

BTW: This similar problem have also been reported in the sifive forum, the link is here FYI: https://forums.sifive.com/t/poor-dhrystone-performance/233/48 (see the last comments discussed between "Drew" and me).

Thanks very much for your help.

Thanks Bob

kito-cheng commented 7 years ago

Try to build with -mexplicit-relocs I think performance can come back again.

@palmer-dabbelt it's seem because -mexplicit-relocs not enable by default, any issue for that?

kito-cheng commented 7 years ago

@palmer-dabbelt oh I see the comment in gcc

  /* We get better code with explicit relocs for CM_MEDLOW, but
     worse code for the others (for now).  Pick the best default.  */
  if ((target_flags_explicit & MASK_EXPLICIT_RELOCS) == 0)
    if (riscv_cmodel == CM_MEDLOW)
      target_flags |= MASK_EXPLICIT_RELOCS;

zhenbohu commented 7 years ago

@kito-cheng

Hi, Kito

Do you mean to just add a gcc option -mexplicit-relocs (no need to rebuilt the toolchain)? I just tried to add this option as a gcc compile option. And unfortunately I see the generated code is still same as before, that kind of worse code is there generated there, and the dhrystone number is still very suck. Could you kindly give me more hint or information about this?

Thanks Bob

kito-cheng commented 7 years ago

@zhenbohu really? could you compile follow code with -O2 + -mcmode=medany/-mcmodel=medany -mexplicit-relocs and paste the assemble code?

btw the problem 2 is pointer aliasing issue since Ptr_Ref_Par may point to Ptr_Glob and the second load is to make sure it's correct when Ptr_Ref_Par really point to Ptr_Glob (Ptr_Ref_Par == &Ptr_Glob).

#include <stddef.h>
typedef       enum    {Ident_1, Ident_2, Ident_3, Ident_4, Ident_5}
              Enumeration;

typedef struct record
    {
    struct record *Ptr_Comp;
    Enumeration    Discr;
    union {
          struct {
                  Enumeration Enum_Comp;
                  int         Int_Comp;
                  char        Str_Comp [31];
                  } var_1;
          struct {
                  Enumeration E_Comp_2;
                  char        Str_2_Comp [31];
                  } var_2;
          struct {
                  char        Ch_1_Comp;
                  char        Ch_2_Comp;
                  } var_3;
          } variant;
      } Rec_Type, *Rec_Pointer;

Rec_Pointer     Ptr_Glob;
int             Int_Glob;

Proc_3 (Ptr_Ref_Par)
  /******************/
      /* executed once */
      /* Ptr_Ref_Par becomes Ptr_Glob */

  Rec_Pointer *Ptr_Ref_Par;

{
  if (Ptr_Glob != NULL)
    /* then, executed */
    *Ptr_Ref_Par = Ptr_Glob->Ptr_Comp;
  Proc_7 (10, Int_Glob, &Ptr_Glob->variant.var_1.Int_Comp);
} /* Proc_3 */

Compile with -mcmode=medany:

...
Proc_3:
    add sp,sp,-16
    sw  ra,12(sp)
    lla a5,Ptr_Glob
    lw  a5,0(a5)
...

Compile with -O2 -mcmode=medany -mexplicit-relocs:

...
Proc_3:
    add sp,sp,-16
    sw  ra,12(sp)
    .LA0: auipc a5,%pcrel_hi(Ptr_Glob)
    lw  a5,%pcrel_lo(.LA0)(a5)
...

zhenbohu commented 7 years ago

@kito-cheng

You can see my script sniplet:

/home/zhenbohu/jx_work/freedom-e-sdk/work/build/riscv-gnu-toolchain/riscv64-unknown-elf/prefix/bin/riscv64-unknown-elf-gcc -Os -fno-common -mcmodel=medany -march=rv32imc -mabi=ilp32 -g -march=rv32imc -mabi=ilp32 -mcmodel=medany -mexplicit-relocs -ffunction-sections -fdata-sections -fno-builtin-printf -fno-builtin-malloc -O2 -mcmodel=medany -mexplicit-relocs -DTIME -fno-inline -fno-builtin-printf -Wno-implicit -march=rv32imc -mabi=ilp32 -c -o dhry_1.o dhry_1.c

......

/home/zhenbohu/jx_work/freedom-e-sdk/work/build/riscv-gnu-toolchain/riscv64-unknown-elf/prefix/bin/riscv64-unknown-elf-gcc -Os -fno-common -mcmodel=medany -march=rv32imc -mabi=ilp32 -g -march=rv32imc -mabi=ilp32 -mcmodel=medany -mexplicit-relocs -ffunction-sections -fdata-sections -fno-builtin-printf -fno-builtin-malloc -I/home/zhenbohu/jx_work/freedom-e-sdk/bsp/include -I/home/zhenbohu/jx_work/freedom-e-sdk/bsp/drivers/ -I/home/zhenbohu/jx_work/freedom-e-sdk/bsp/env -I/home/zhenbohu/jx_work/freedom-e-sdk/bsp/env/sirv-e203-arty dhry_1.o dhry_2.o /home/zhenbohu/jx_work/freedom-e-sdk/bsp/env/start.o /home/zhenbohu/jx_work/freedom-e-sdk/bsp/env/entry.o dhry_stubs.o /home/zhenbohu/jx_work/freedom-e-sdk/bsp/env/sirv-e203-arty/init.o /home/zhenbohu/jx_work/freedom-e-sdk/bsp/env/sirv_printf.o -o dhrystone -Wl,--wrap=scanf -Wl,--wrap=printf -march=rv32imc -mabi=ilp32 -mcmodel=medany -Wl,--wrap=malloc -Wl,--wrap=free -Wl,--wrap=open -Wl,--wrap=lseek -Wl,--wrap=read -Wl,--wrap=write -Wl,--wrap=fstat -Wl,--wrap=stat -Wl,--wrap=close -Wl,--wrap=link -Wl,--wrap=unlink -Wl,--wrap=execve -Wl,--wrap=fork -Wl,--wrap=getpid -Wl,--wrap=kill -Wl,--wrap=wait -Wl,--wrap=isatty -Wl,--wrap=times -Wl,--wrap=sbrk -Wl,--wrap=_exit -L. -Wl,--start-group -lwrap -lc -Wl,--end-group -T /home/zhenbohu/jx_work/freedom-e-sdk/bsp/env/sirv-e203-arty/link.lds -nostartfiles -Wl,--gc-sections -Wl,--wrap=scanf -Wl,--wrap=malloc -Wl,--wrap=printf -Wl,--check-sections -L/home/zhenbohu/jx_work/freedom-e-sdk/bsp/env

And then I dump it:

8000073e : 8000073e: 10000797 auipc a5,0x10000 80000742: d3278793 addi a5,a5,-718 # 90000470 80000746: 4390 lw a2,0(a5) 80000748: c601 beqz a2,80000750 <Proc_3+0x12> 8000074a: 4218 lw a4,0(a2) 8000074c: c118 sw a4,0(a0) 8000074e: 4390 lw a2,0(a5) 80000750: 10000797 auipc a5,0x10000 80000754: d147a583 lw a1,-748(a5) # 90000464 80000758: 0631 addi a2,a2,12 8000075a: 4529 li a0,10 8000075c: a86d j 80000816

Can you see any clues from it?

Thanks Bob

kito-cheng commented 7 years ago

Hmmmm, I don't know why -mexplicit-relocs don't improve your code gen, but maybe you can try to build with -mcmodel=medlow?

zhenbohu commented 7 years ago

@kito-cheng

Hi, Kito

Thanks for your info, after I use the -medlow you suggested, the generated code is:

80000650 : 80000650: 84818793 addi a5,gp,-1976 80000654: 4390 lw a2,0(a5) 80000656: c601 beqz a2,8000065e <Proc_3+0xe> 80000658: 4218 lw a4,0(a2) 8000065a: c118 sw a4,0(a0) 8000065c: 4390 lw a2,0(a5) 8000065e: 83c1a583 lw a1,-1988(gp) 80000662: 0631 addi a2,a2,12 80000664: 4529 li a0,10 80000666: a04d j 80000708

Looks like the instruction count is reduced much, and much better now.

What is the difference of medlow and medany? which option should I to use exactly? will it impact any functionality correctness?

Thanks Bob

palmer-dabbelt commented 7 years ago

I just added documentation: https://github.com/riscv/riscv-gcc/commit/efffc4465762e6a7533afaeec6f78ee5f838b374

I'd also expect -mexplicit-relocs to fix your problem, and when combined with the latest binutils that relaxes auipc+load sequinces to gp-relative loads that you should get the same performance in medlow and medany mode.

kito-cheng commented 7 years ago

@palmer-dabbelt @aswaterman Just off topic, it's seem medany have no any benefit on RV32, how about alias medany to medlow for RV32?

aswaterman commented 7 years ago

@kito-cheng medany is useful for RV32 for some specialized code (e.g. low-level boot code that is almost-PIC)

zhenbohu commented 7 years ago

@kito-cheng

Hi, Kito

After I switch to medlow and compile my dhrystone code and excute it on board, the result is strangely incorrect, but I cannot root the cause out. I dont know why it is incorrect.

And BTW: I noticed in freedom-e-sdk (by Sifive) software demo, the dhrystone program have its Makefile explicitely used option -mcmodel=medany.

Do you have any rough idea, why the dhrystone (from freedom-e-sdk) is explicitly use option mcmodel=medany. And why if I changed it to medlow, the result is incorrect?

Thanks Bob

kito-cheng commented 7 years ago

@zhenbohu I am not SiFive guys, so the reason why -mcmodel=medany is enable by default in freedom-e-sdk you may ask @aswaterman or @palmer-dabbelt

palmer-dabbelt commented 7 years ago

Our 64-bit cores have the scratchpad at 0x80000000, which is above the region that can be addressed by the medlow code model.

zhenbohu commented 7 years ago

Dear Experts

Many folks include myself are noticing the performance dropping by using latest toolchain compared with old GCC6.1.0 version, not only about the benchmark but also about the code size.

Not sure if this problem is on the way to be fixed? could you kindly share some of your information?

Thanks Bob

zhenbohu commented 7 years ago

@kito-cheng

Hi, Kito Do you mind to share some of information about this?

Thanks Bob

zhenbohu commented 7 years ago

@palmer-dabbelt @aswaterman

Hi, Palmer, Andrew

Do you guys have some information to share?

Thanks Bob

aswaterman commented 7 years ago

We are working to improve code generation. But it's not a general issue; it's a large number of small issues. If you can file new issues with specific examples of worse code generation, we can try to improve the compiler.

jim-wilson commented 6 years ago

It looks like the only problem here is accidental use of -mcmodel=medany when they wanted -mcmodel=medlow code.

palmer-dabbelt commented 6 years ago

FWIW: medany is fast now, so that's not even a problem any more :)

riscv-collab / riscv-gnu-toolchain

Huge dhrystone performance degradation caused by the GCC bumped from 6.1.0 to 7.1.0 version #249