Cortex-A5 codegen suboptimal?

llvmbot commented 8 years ago


Bugzilla Link	26135
Resolution	WORKSFORME
Resolved on	Sep 02, 2016 13:44
Version	3.7
OS	Linux
Attachments	fannkuch benchmark, spectral-norm benchmark
Reporter	LLVM Bugzilla Contributor
CC	@jmolloy

Extended Description

It seems code generation for Cortex-A5 and armv7 doesn't provide much benefit, sometimes even coming out a little slower (compared to default and v6 respectively).

Firstly, our old friend from issue #26106, w/o NEON and unrolling:

default cpu - v7, v6:

test sum_deque ... bench: 5,219 ns/iter (+/- 56), 4,967 ns/iter (+/- 50)

test sum_deque_2 ... bench: 3,272 ns/iter (+/- 40), 3,112 ns/iter (+/- 22)

It seems v6 code is faster on this cpu.

Secondly a few benchmarks, where the cortex-a5 target is either equal or slower. (4-core)

Spectral-norm benchmark, v7: $ time ./spectral 5500

default cpu cortex-a5

real 0m9.106s real 0m9.051s user 0m34.110s user 0m34.240s sys 0m0.040s sys 0m0.020s

Fannkuch benchmark: $ time ./fannkuch 12

real 0m30.017s real 0m30.645s user 1m55.570s user 1m56.350s sys 0m0.030s sys 0m0.010s

Command used to compile: rustc -C opt-level=3 -C target-feature=+v7 -C target-cpu=cortex-a5

Those were just a few examples off the top of my head, probably not the best ones. I'm sure I've also seen an example or two benefiting from the cortex-a5 target but can't remember which.

llvmbot commented 8 years ago

Forgot to ask; is the current generation of Cortex A8's still affected or can this flag be enabled safely everywhere? TIA!

llvmbot commented 8 years ago

Indeed, that brings performance completely in line with the fastest version.

Thanks!

jmolloy commented 8 years ago

Ideally a patch, but give -mattr=-slowfpvmlx a whirl (note the minus sign instead of a plus).

llvmbot commented 8 years ago

Excellent! Is this feature settable from the command line or are we talking about a patch?

jmolloy commented 8 years ago

OK, the fault seems pretty obvious. We're not forming VMLA/VMLS when targetting Cortex-A5.

This appears to be because we've set the "FeatureHasSlowFPVMLx" feature flag, which is almost certainly because the A5 model was based directly on Cortex-A8 which had a slowdown when using VMLA/VMLS.

Fix is simply to remove this flag.

James

llvmbot commented 8 years ago

Oops, I must started believing in your powers of clairvoyance ;)

The slower version produces a trace like this:

Profiling raytrace-a8fc715418690b23 with callgrind...

Total Instructions....342,773,768

141,645,505 (41.3%) model.rs:_..raytrace..model..Sphere..as..raytrace..model..Model..::hit

110,769,356 (32.3%) vec.rs:_..raytrace..model..Sphere..as..raytrace..model..Model..::hit

67,971,353 (19.8%) model.rs:mallocx'2

12,524,673 (3.7%) slice.rs:mallocx'2

6,248,154 (1.8%) ptr.rs:mallocx'2

1,279,177 (0.4%) vec.rs:mallocx'2

669,986 (0.2%) render.rs:mallocx'2

399,334 (0.1%) main.rs:mallocx'2

307,841 (0.1%) iterator.rs:mallocx'2

260,430 (0.1%) lib.rs:mallocx'2

244,210 (0.1%) lib.rs:_..raytrace..materials..Lambertian..as..raytrace..materials..Material..::scatter

238,745 (0.1%) vec.rs:_..raytrace..materials..Lambertian..as..raytrace..materials..Material..::scatter

215,004 (0.1%) f32.rs:_..raytrace..model..Sphere..as..raytrace..model..Model..::hit

and the faster version:

Total Instructions...281,513,893

117,094,561 (41.6%) model.rs:_..raytrace..model..Sphere..as..raytrace..model..Model..::hit

73,884,689 (26.2%) vec.rs:_..raytrace..model..Sphere..as..raytrace..model..Model..::hit

67,926,746 (24.1%) model.rs:mallocx'2

12,506,985 (4.4%) slice.rs:mallocx'2

6,246,225 (2.2%) ptr.rs:mallocx'2

997,208 (0.4%) vec.rs:mallocx'2

653,355 (0.2%) render.rs:mallocx'2

399,504 (0.1%) main.rs:mallocx'2

308,487 (0.1%) iterator.rs:mallocx'2

257,294 (0.1%) lib.rs:mallocx'2

243,601 (0.1%) lib.rs:_..raytrace..materials..Lambertian..as..raytrace..materials..Material..::scatter

215,004 (0.1%) f32.rs:_..raytrace..model..Sphere..as..raytrace..model..Model..::hit

213,231 (0.1%) vec.rs:_..raytrace..materials..Lambertian..as..raytrace..materials..Material..::scatter

153,709 (0.1%) materials.rs:_..raytrace..materials..Lambertian..as..raytrace..materials..Material..::scatter

153,084 (0.1%) wrapping.rs:_..raytrace..materials..Lambertian..as..raytrace..materials..Material..::scatter

130,364 (0.0%) materials.rs:_..raytrace..materials..Metal..as..raytrace..materials..Material..::scatter

129,846 (0.0%) dl-tls.c:__tls_get_addr

Cheers!

jmolloy commented 8 years ago

Hi Pete,

That disassembly is huge. Where's the hot part?

Cheers,

James

llvmbot commented 8 years ago

raytrace benchmark assembly + IR

llvmbot commented 8 years ago

Hi again,

I think I've finally found something worth looking at :) The following rust benchmark:

https://github.com/nikomatsakis/rust-runtime-benchmarks/tree/master/runtime-benchmarks/raytrace

produces the following results (using MIR translation via -Zorbit):

ARMv6 vfp2 test bench ... bench: 239,727,289 ns/iter (+/- 274,873)

ARMv7 cortex-a5 vfp4 test bench ... bench: 264,764,759 ns/iter (+/- 159,791)

Adding +neon degrades performance further a little bit but that's just noise probably. Hope you care to look at the files I'm going to attach. Thanks!

llvmbot commented 8 years ago

Kmeans benchmark source Running the benchmark requires the following file: https://raw.githubusercontent.com/andreaferretti/kmeans/master/points.json

llvmbot commented 8 years ago

Another case from kmeans benchmark (https://github.com/andreaferretti/kmeans) running on Cortex-A5.

10 runs' average time:

Cortex-A5 1027 ms Cortex-A9 1012 ms

Compiled with rustc 1.8.0 (llvm 3.8)using the following flags:

-C opt-level=3 -C target-cpu=cortex-a5(9) -C target-feature=+vfp4,+neon,+v7

llvm / llvm-project