linux-sunxi / meta-sunxi

Official sunxi OpenEmbedded layer for Allwinner-based boards.
MIT License
144 stars 182 forks source link

Improve performance by changing default machine tuning options #25

Closed KristofRobot closed 10 years ago

KristofRobot commented 10 years ago

Note: I am focusing on Cubieboard2 in this post, as that is the board I own and can test on, but this should be relevant to other boards as well

Currently our cubieboard2 machine.conf file [1] falls back on the default tuning option specified in arch-armv7a.inc [2]:

DEFAULTTUNE ?= "armv7a-neon"

This boils down to the following compiler options:

-march=armv7-a -mthumb-interwork -mfloat-abi=softfp -mfpu=neon

That is really the lowest performance option, and cubieboard2 is capable of more than that. Specifically, it supports NEONv2, VFPv4 and Thumb-2 (see [3] and [4] ) - but the default tuning file does not take advantage of that.

I'd propose to set the default tune to something that supports all the capabilities of the Allwinner A20 chip, i.e.:

DEFAULTTUNE = "cortexa7thf-neon"

resulting in

-march=armv7-a -marm -mthumb-interwork -mfloat-abi=hard -mfpu=neon -mtune=cortex-a7

Note that this still does not take advantage of the NEONv2/VFPv4 capabilities of the Allwinner A20 - for that we'd need -mfpu=neon-vfpv4 [5]. I am currently using an ugly hack to force this compile option in my builds, and opened a request upstream to add this ([6]).

I'll try to run some benchmarks comparing the default with the proposed tuning options above, to put some data behind this, and get an idea of how big the difference is really.

In the meantime, all comments welcome.

Thanks!

Kristof

[1] https://github.com/linux-sunxi/meta-sunxi/blob/master/conf/machine/cubieboard2.conf [2] https://github.com/openembedded/oe-core/blob/master/meta/conf/machine/include/tune-cortexa7.inc [3] http://linux-sunxi.org/Allwinner_SoC_Family [4] http://wits-hep.blogspot.fr/2013/12/fftw-benchmarks-on-cortex-a7.html [5] http://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html [6] https://bugzilla.yoctoproject.org/show_bug.cgi?id=5710

KristofRobot commented 10 years ago

Btw, the ugly hack that I am using: Just replaced -mfpu=neon in oe-core/meta/conf/machine/include/arm/feature-arm-neon.inc [1] by -mfpu=neon-vfpv4, i.e.:

TUNEVALID[neon] = "Enable Neon SIMD accelerator unit."
TUNE_CCARGS .= "${@bb.utils.contains("TUNE_FEATURES", "neon", " -mfpu=neon-vfpv4", "" ,d)}"
ARMPKGSFX_FPU .= "${@bb.utils.contains("TUNE_FEATURES", "neon", "-neon", "" ,d)}"

If anyone has an idea of a less ugly hack, i.e. something that can be applied within meta-sunxi scope, and preferably within machine.conf, please let me know!

Kristof

[1] https://github.com/openembedded/oe-core/blob/master/meta/conf/machine/include/arm/feature-arm-neon.inc

naguirre commented 10 years ago

Hi kristof,

A year ago i had hardfp enabled for the whole layer by default. But when I present my meta to the angstrom mailing list, I get the answer that the hardfp/softfp must be a decision of the distro. And that i have to remove this option.

So what we do in calaos distro is : https://github.com/calaos/calaos-os/blob/master/conf/local.conf#L48

I would prefer to enable that option by default instead of redefining it for each machine. But the argument of angstrom guys seems also correct. I don't really know what to do here.

KristofRobot commented 10 years ago

@naguirre Ah, I was not aware of any conventions of putting that NOT in the machine.conf; however, if that is the case, I am fine with putting that option in local.conf.

Would be nice to document this somewhere (e.g. in the README), as people might not be aware of those options (I was not until very recently).

Btw, is there any reason why you don't have "t" (thumb) enabled?

KristofRobot commented 10 years ago

Btw, is there any reason why you don't have "t" (thumb) enabled?

I just read in feature-arm-thumb.inc that this might be slower - so that's probably why:

Thumb code is smaller (maybe 70% of the ARM size)
# but requires more instructions (140% for 70% smaller code) so may be
# slower.

Thought I had read somewhere that that was also a speed improvement, but apparently not.

EDIT: Apparently Thumb2 is supposed to combine best of both worlds ([2])

 The availability of 16-bit and 32-bit instructions enable Thumb-2 to combine the code density of earlier versions of Thumb with the performance of the ARM instruction set.

[1] https://github.com/openembedded/oe-core/blob/master/meta/conf/machine/include/arm/feature-arm-thumb.inc [2] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0471c/CHDFEDDB.html

KristofRobot commented 10 years ago

Btw, good thread discussing thumb performance - http://stackoverflow.com/questions/1198176/arm-vs-thumb-performance-on-iphone-3gs-non-floating-point-code

Guess that it indeed boils down to "what works best in my specific use case" - will need to do some tests :)

KristofRobot commented 10 years ago

I'll try to run some benchmarks comparing the default with the proposed tuning options above, to put some data behind this, and get an idea of how big the difference is really.

I ran the linpack benchmark referenced at [1]

  1. DEFAULTTUNE = "armv7a-neon"
LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      32   0.97  89.81%   2.84%   7.35%  49082.897
      64   1.93  89.79%   2.85%   7.36%  49068.895
     128   3.87  89.80%   2.84%   7.36%  49077.745
     256   7.73  89.80%   2.84%   7.35%  49076.649
     512  15.46  89.80%   2.85%   7.35%  49077.121
  1. DEFAULTTUNE = "cortexa7thf-neon" & ARM_KEEP_OABI = "0" & hack to use -mfpu=neon-vfpv4
LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      32   0.80  88.27%   2.83%   8.90%  60384.217
      64   1.60  88.26%   2.86%   8.89%  60363.937
     128   3.20  88.27%   2.85%   8.89%  60373.142
     256   6.39  88.26%   2.85%   8.89%  60374.915
     512  12.78  88.27%   2.84%   8.89%  60380.737

I do not get the quite dramatic improvements listed at [1] though - those results were obtained with more aggressive compiler options.

Still, nice 20% improvement in this case, and likely to be relevant in more general use cases.

Kristof

[1] http://linux-sunxi.org/Benchmarks

KristofRobot commented 10 years ago

In fact, even with the exact same compiler options as listed at [1], I still get only

LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      32   0.70  90.05%   3.04%   6.91%  67624.822
      64   1.40  90.04%   3.04%   6.92%  67597.673
     128   2.79  90.05%   3.04%   6.92%  67619.099
     256   5.59  90.04%   3.03%   6.92%  67624.432
     512  11.17  90.05%   3.03%   6.92%  67628.568

Slightly better, but nowhere near the performance reported at [1].

Have others tried replicating those results?

[1] http://linux-sunxi.org/Benchmarks

KristofRobot commented 10 years ago

In fact, it seems that a more significant (and easier) change is just to change the CPU governor settings, as explained at [1].

With the recommended settings there, and neon-vfpv4 (but without the aggressive compiler options):

LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      64   0.61  88.29%   2.82%   8.89%  158879.935
     128   1.21  88.28%   2.83%   8.89%  158827.108
     256   2.43  88.28%   2.83%   8.89%  158834.284
     512   4.86  88.28%   2.83%   8.89%  158849.928
    1024   9.73  88.22%   2.85%   8.93%  158746.159
    2048  19.45  88.25%   2.85%   8.91%  158719.016

Nice! :)

[1] http://linux-sunxi.org/Cpufreq

KristofRobot commented 10 years ago

With: DEFAULTTUNE ?= "armv7ahf-neon" && performance CPU governor at 1008Mhz:

# cat /proc/cpuinfo | grep Bogo
BogoMIPS    : 2011.05
BogoMIPS    : 2011.05
# linpackc 
Enter array size (q to quit) [200]:  
Memory required:  315K.

LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      64   0.74  89.83%   2.82%   7.35%  128953.757
     128   1.47  89.83%   2.83%   7.34%  128925.100
     256   2.94  89.82%   2.84%   7.34%  128903.592
     512   5.89  89.82%   2.83%   7.35%  128925.549
    1024  11.77  89.82%   2.83%   7.34%  128934.675
KristofRobot commented 10 years ago

With: DEFAULTTUNE ?= "armv7a-neon" && performance CPU governor at 1008Mhz:

# cat /proc/cpuinfo | grep Bogo
BogoMIPS    : 2011.05
BogoMIPS    : 2011.05
# linpackc 
Enter array size (q to quit) [200]:  
Memory required:  315K.

LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      64   0.74  89.82%   2.84%   7.35%  128994.447
     128   1.47  89.81%   2.84%   7.35%  128972.585
     256   2.94  89.81%   2.84%   7.35%  128981.669
     512   5.88  89.81%   2.84%   7.35%  128993.429
    1024  11.77  89.81%   2.84%   7.35%  128991.761
KristofRobot commented 10 years ago

I'm happy to announce that a patch that includes the new tuning options supporting 'neon-vfpv4' has been merged upstream in oe-core, see [1].

This allows you to specify DEFAULTTUNE = cortexa7thf-neon-vfpv4 (or DEFAULTTUNE = cortexa7hf-neon-vfpv4 if you do not like thumb) to get the most out of your A20!

Kristof

[1] http://git.yoctoproject.org/cgit.cgi/poky/commit/?id=e65422f0f79d6069a3312cb4a3d110ec809017ad

KristofRobot commented 10 years ago

I just noticed that I actually never really used thumb instructions. OpenEmbedded by default includes the -marm option, rather than the -mthumb option, even when requesting a thumb tuning profile. This is discussed at [1], and also visible from the compiler options I pasted above earlier.

So yes, this reinforces the argument that, in practice, probably almost nobody uses thumb instructions (even when they might think they do).

The trick to enforce thumb is to also set:

ARM_INSTRUCTION_SET = thumb

I might experiment with this later.

[1] http://article.gmane.org/gmane.comp.handhelds.openembedded.core/47005

KristofRobot commented 10 years ago

(1) DEFAULTTUNE = cortexa7hf-neon-vfpv4 & performance governor at 1080Mhz:

LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      64   0.61  88.28%   2.84%   8.88%  158541.762
     128   1.22  88.27%   2.85%   8.88%  158514.885
     256   2.43  88.25%   2.85%   8.90%  158570.866
     512   4.87  88.26%   2.85%   8.89%  158566.682
    1024   9.73  88.27%   2.85%   8.88%  158557.868
    2048  19.47  88.27%   2.85%   8.89%  158558.324

Image size: 147 MB

(2) DEFAULTTUNE = cortexa7thf-neon-vfpv4 & ARM_INSTRUCTION_SET = thumb & performance governor at 1080Mhz (i.e. real thumb):

LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      64   0.61  88.14%   2.83%   9.04%  158024.691
     128   1.22  88.11%   2.84%   9.05%  158031.084
     256   2.45  88.11%   2.84%   9.05%  158040.035
     512   4.89  88.12%   2.84%   9.04%  158033.002
    1024   9.78  88.12%   2.84%   9.04%  158054.351
    2048  19.57  88.12%   2.84%   9.05%  158049.991

Image size: 149 MB

Conclusion: thumb performance in this simple test is 0.3% slower, and size is 1.3% smaller. So the expected tendencies described earlier (minimal performance loss, more dense) are there, but are not significant (at least not in this simple linpackc test).

Note: I ran these linpackc benchmarks multiple times, and posted one "representative" one - typically I had about 0.1% variation (200 KFlops) among consecutive runs.

EDIT: corrected percentages

asimko commented 9 years ago

Does anybody know how to solve this "bug": https://bugzilla.yoctoproject.org/show_bug.cgi?id=7275

naguirre commented 9 years ago

It seems to be a problem, could you please open an issue ?