Closed KristofRobot closed 10 years ago
Btw, the ugly hack that I am using:
Just replaced -mfpu=neon
in oe-core/meta/conf/machine/include/arm/feature-arm-neon.inc
[1] by -mfpu=neon-vfpv4
, i.e.:
TUNEVALID[neon] = "Enable Neon SIMD accelerator unit."
TUNE_CCARGS .= "${@bb.utils.contains("TUNE_FEATURES", "neon", " -mfpu=neon-vfpv4", "" ,d)}"
ARMPKGSFX_FPU .= "${@bb.utils.contains("TUNE_FEATURES", "neon", "-neon", "" ,d)}"
If anyone has an idea of a less ugly hack, i.e. something that can be applied within meta-sunxi scope, and preferably within machine.conf, please let me know!
Kristof
Hi kristof,
A year ago i had hardfp enabled for the whole layer by default. But when I present my meta to the angstrom mailing list, I get the answer that the hardfp/softfp must be a decision of the distro. And that i have to remove this option.
So what we do in calaos distro is : https://github.com/calaos/calaos-os/blob/master/conf/local.conf#L48
I would prefer to enable that option by default instead of redefining it for each machine. But the argument of angstrom guys seems also correct. I don't really know what to do here.
@naguirre Ah, I was not aware of any conventions of putting that NOT in the machine.conf; however, if that is the case, I am fine with putting that option in local.conf.
Would be nice to document this somewhere (e.g. in the README), as people might not be aware of those options (I was not until very recently).
Btw, is there any reason why you don't have "t" (thumb) enabled?
Btw, is there any reason why you don't have "t" (thumb) enabled?
I just read in feature-arm-thumb.inc
that this might be slower - so that's probably why:
Thumb code is smaller (maybe 70% of the ARM size)
# but requires more instructions (140% for 70% smaller code) so may be
# slower.
Thought I had read somewhere that that was also a speed improvement, but apparently not.
EDIT: Apparently Thumb2 is supposed to combine best of both worlds ([2])
The availability of 16-bit and 32-bit instructions enable Thumb-2 to combine the code density of earlier versions of Thumb with the performance of the ARM instruction set.
[1] https://github.com/openembedded/oe-core/blob/master/meta/conf/machine/include/arm/feature-arm-thumb.inc [2] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0471c/CHDFEDDB.html
Btw, good thread discussing thumb performance - http://stackoverflow.com/questions/1198176/arm-vs-thumb-performance-on-iphone-3gs-non-floating-point-code
Guess that it indeed boils down to "what works best in my specific use case" - will need to do some tests :)
I'll try to run some benchmarks comparing the default with the proposed tuning options above, to put some data behind this, and get an idea of how big the difference is really.
I ran the linpack benchmark referenced at [1]
DEFAULTTUNE = "armv7a-neon"
LINPACK benchmark, Double precision.
Machine precision: 15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:
Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
32 0.97 89.81% 2.84% 7.35% 49082.897
64 1.93 89.79% 2.85% 7.36% 49068.895
128 3.87 89.80% 2.84% 7.36% 49077.745
256 7.73 89.80% 2.84% 7.35% 49076.649
512 15.46 89.80% 2.85% 7.35% 49077.121
DEFAULTTUNE = "cortexa7thf-neon"
& ARM_KEEP_OABI = "0"
& hack to use -mfpu=neon-vfpv4
LINPACK benchmark, Double precision.
Machine precision: 15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:
Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
32 0.80 88.27% 2.83% 8.90% 60384.217
64 1.60 88.26% 2.86% 8.89% 60363.937
128 3.20 88.27% 2.85% 8.89% 60373.142
256 6.39 88.26% 2.85% 8.89% 60374.915
512 12.78 88.27% 2.84% 8.89% 60380.737
I do not get the quite dramatic improvements listed at [1] though - those results were obtained with more aggressive compiler options.
Still, nice 20% improvement in this case, and likely to be relevant in more general use cases.
Kristof
In fact, even with the exact same compiler options as listed at [1], I still get only
LINPACK benchmark, Double precision.
Machine precision: 15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:
Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
32 0.70 90.05% 3.04% 6.91% 67624.822
64 1.40 90.04% 3.04% 6.92% 67597.673
128 2.79 90.05% 3.04% 6.92% 67619.099
256 5.59 90.04% 3.03% 6.92% 67624.432
512 11.17 90.05% 3.03% 6.92% 67628.568
Slightly better, but nowhere near the performance reported at [1].
Have others tried replicating those results?
In fact, it seems that a more significant (and easier) change is just to change the CPU governor settings, as explained at [1].
With the recommended settings there, and neon-vfpv4 (but without the aggressive compiler options):
LINPACK benchmark, Double precision.
Machine precision: 15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:
Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
64 0.61 88.29% 2.82% 8.89% 158879.935
128 1.21 88.28% 2.83% 8.89% 158827.108
256 2.43 88.28% 2.83% 8.89% 158834.284
512 4.86 88.28% 2.83% 8.89% 158849.928
1024 9.73 88.22% 2.85% 8.93% 158746.159
2048 19.45 88.25% 2.85% 8.91% 158719.016
Nice! :)
With:
DEFAULTTUNE ?= "armv7ahf-neon"
&& performance CPU governor at 1008Mhz:
# cat /proc/cpuinfo | grep Bogo
BogoMIPS : 2011.05
BogoMIPS : 2011.05
# linpackc
Enter array size (q to quit) [200]:
Memory required: 315K.
LINPACK benchmark, Double precision.
Machine precision: 15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:
Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
64 0.74 89.83% 2.82% 7.35% 128953.757
128 1.47 89.83% 2.83% 7.34% 128925.100
256 2.94 89.82% 2.84% 7.34% 128903.592
512 5.89 89.82% 2.83% 7.35% 128925.549
1024 11.77 89.82% 2.83% 7.34% 128934.675
With:
DEFAULTTUNE ?= "armv7a-neon"
&& performance CPU governor at 1008Mhz:
# cat /proc/cpuinfo | grep Bogo
BogoMIPS : 2011.05
BogoMIPS : 2011.05
# linpackc
Enter array size (q to quit) [200]:
Memory required: 315K.
LINPACK benchmark, Double precision.
Machine precision: 15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:
Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
64 0.74 89.82% 2.84% 7.35% 128994.447
128 1.47 89.81% 2.84% 7.35% 128972.585
256 2.94 89.81% 2.84% 7.35% 128981.669
512 5.88 89.81% 2.84% 7.35% 128993.429
1024 11.77 89.81% 2.84% 7.35% 128991.761
I'm happy to announce that a patch that includes the new tuning options supporting 'neon-vfpv4' has been merged upstream in oe-core, see [1].
This allows you to specify DEFAULTTUNE = cortexa7thf-neon-vfpv4
(or DEFAULTTUNE = cortexa7hf-neon-vfpv4
if you do not like thumb) to get the most out of your A20!
Kristof
[1] http://git.yoctoproject.org/cgit.cgi/poky/commit/?id=e65422f0f79d6069a3312cb4a3d110ec809017ad
I just noticed that I actually never really used thumb instructions.
OpenEmbedded by default includes the -marm
option, rather than the -mthumb
option, even when requesting a thumb tuning profile. This is discussed at [1], and also visible from the compiler options I pasted above earlier.
So yes, this reinforces the argument that, in practice, probably almost nobody uses thumb instructions (even when they might think they do).
The trick to enforce thumb is to also set:
ARM_INSTRUCTION_SET = thumb
I might experiment with this later.
[1] http://article.gmane.org/gmane.comp.handhelds.openembedded.core/47005
(1) DEFAULTTUNE = cortexa7hf-neon-vfpv4 & performance governor at 1080Mhz:
LINPACK benchmark, Double precision.
Machine precision: 15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:
Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
64 0.61 88.28% 2.84% 8.88% 158541.762
128 1.22 88.27% 2.85% 8.88% 158514.885
256 2.43 88.25% 2.85% 8.90% 158570.866
512 4.87 88.26% 2.85% 8.89% 158566.682
1024 9.73 88.27% 2.85% 8.88% 158557.868
2048 19.47 88.27% 2.85% 8.89% 158558.324
Image size: 147 MB
(2) DEFAULTTUNE = cortexa7thf-neon-vfpv4 & ARM_INSTRUCTION_SET = thumb & performance governor at 1080Mhz (i.e. real thumb):
LINPACK benchmark, Double precision.
Machine precision: 15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:
Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
64 0.61 88.14% 2.83% 9.04% 158024.691
128 1.22 88.11% 2.84% 9.05% 158031.084
256 2.45 88.11% 2.84% 9.05% 158040.035
512 4.89 88.12% 2.84% 9.04% 158033.002
1024 9.78 88.12% 2.84% 9.04% 158054.351
2048 19.57 88.12% 2.84% 9.05% 158049.991
Image size: 149 MB
Conclusion: thumb performance in this simple test is 0.3% slower, and size is 1.3% smaller. So the expected tendencies described earlier (minimal performance loss, more dense) are there, but are not significant (at least not in this simple linpackc test).
Note: I ran these linpackc benchmarks multiple times, and posted one "representative" one - typically I had about 0.1% variation (200 KFlops) among consecutive runs.
EDIT: corrected percentages
Does anybody know how to solve this "bug": https://bugzilla.yoctoproject.org/show_bug.cgi?id=7275
It seems to be a problem, could you please open an issue ?
Note: I am focusing on Cubieboard2 in this post, as that is the board I own and can test on, but this should be relevant to other boards as well
Currently our cubieboard2 machine.conf file [1] falls back on the default tuning option specified in arch-armv7a.inc [2]:
This boils down to the following compiler options:
That is really the lowest performance option, and cubieboard2 is capable of more than that. Specifically, it supports NEONv2, VFPv4 and Thumb-2 (see [3] and [4] ) - but the default tuning file does not take advantage of that.
I'd propose to set the default tune to something that supports all the capabilities of the Allwinner A20 chip, i.e.:
resulting in
Note that this still does not take advantage of the NEONv2/VFPv4 capabilities of the Allwinner A20 - for that we'd need
-mfpu=neon-vfpv4
[5]. I am currently using an ugly hack to force this compile option in my builds, and opened a request upstream to add this ([6]).I'll try to run some benchmarks comparing the default with the proposed tuning options above, to put some data behind this, and get an idea of how big the difference is really.
In the meantime, all comments welcome.
Thanks!
Kristof
[1] https://github.com/linux-sunxi/meta-sunxi/blob/master/conf/machine/cubieboard2.conf [2] https://github.com/openembedded/oe-core/blob/master/meta/conf/machine/include/tune-cortexa7.inc [3] http://linux-sunxi.org/Allwinner_SoC_Family [4] http://wits-hep.blogspot.fr/2013/12/fftw-benchmarks-on-cortex-a7.html [5] http://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html [6] https://bugzilla.yoctoproject.org/show_bug.cgi?id=5710