Closed llvmbot closed 8 years ago
Forgot to ask; is the current generation of Cortex A8's still affected or can this flag be enabled safely everywhere? TIA!
Indeed, that brings performance completely in line with the fastest version.
Thanks!
Ideally a patch, but give -mattr=-slowfpvmlx a whirl (note the minus sign instead of a plus).
Excellent! Is this feature settable from the command line or are we talking about a patch?
OK, the fault seems pretty obvious. We're not forming VMLA/VMLS when targetting Cortex-A5.
This appears to be because we've set the "FeatureHasSlowFPVMLx" feature flag, which is almost certainly because the A5 model was based directly on Cortex-A8 which had a slowdown when using VMLA/VMLS.
Fix is simply to remove this flag.
James
Oops, I must started believing in your powers of clairvoyance ;)
The slower version produces a trace like this:
Profiling raytrace-a8fc715418690b23 with callgrind...
Total Instructions....342,773,768
and the faster version:
Total Instructions...281,513,893
Cheers!
Hi Pete,
That disassembly is huge. Where's the hot part?
Cheers,
James
Hi again,
I think I've finally found something worth looking at :) The following rust benchmark:
https://github.com/nikomatsakis/rust-runtime-benchmarks/tree/master/runtime-benchmarks/raytrace
produces the following results (using MIR translation via -Zorbit):
ARMv6 vfp2
test bench ... bench: 239,727,289 ns/iter (+/- 274,873)
ARMv7 cortex-a5 vfp4
test bench ... bench: 264,764,759 ns/iter (+/- 159,791)
Adding +neon
degrades performance further a little bit but that's just noise probably. Hope you care to look at the files I'm going to attach. Thanks!
Kmeans benchmark source Running the benchmark requires the following file: https://raw.githubusercontent.com/andreaferretti/kmeans/master/points.json
Another case from kmeans benchmark (https://github.com/andreaferretti/kmeans) running on Cortex-A5.
10 runs' average time:
Cortex-A5 1027 ms Cortex-A9 1012 ms
Compiled with rustc 1.8.0 (llvm 3.8)using the following flags:
-C opt-level=3 -C target-cpu=cortex-a5(9) -C target-feature=+vfp4,+neon,+v7
Extended Description
It seems code generation for Cortex-A5 and armv7 doesn't provide much benefit, sometimes even coming out a little slower (compared to default and v6 respectively).
Firstly, our old friend from issue #26106, w/o NEON and unrolling:
default cpu - v7, v6:
test sum_deque ... bench: 5,219 ns/iter (+/- 56), 4,967 ns/iter (+/- 50)
test sum_deque_2 ... bench: 3,272 ns/iter (+/- 40), 3,112 ns/iter (+/- 22)
It seems v6 code is faster on this cpu.
Secondly a few benchmarks, where the cortex-a5 target is either equal or slower. (4-core)
Spectral-norm benchmark, v7: $ time ./spectral 5500
default cpu cortex-a5
real 0m9.106s real 0m9.051s user 0m34.110s user 0m34.240s sys 0m0.040s sys 0m0.020s
Fannkuch benchmark: $ time ./fannkuch 12
real 0m30.017s real 0m30.645s user 1m55.570s user 1m56.350s sys 0m0.030s sys 0m0.010s
Command used to compile: rustc -C opt-level=3 -C target-feature=+v7 -C target-cpu=cortex-a5
Those were just a few examples off the top of my head, probably not the best ones. I'm sure I've also seen an example or two benefiting from the cortex-a5 target but can't remember which.