Jemalloc performance on 64-bit ARM

rust-lang / rust

Empowering everyone to build reliable and efficient software.

https://www.rust-lang.org

Other

96.71k stars 12.5k forks source link

Jemalloc performance on 64-bit ARM #34476

Closed MagaTailor closed 8 years ago

MagaTailor commented 8 years ago

I've just run the binary_trees benchmark on an ARMv8, Cortex-A53 processor, having converted an Android TV box to Linux.

I'd found previously, on a much weaker (but more power efficient) armv7 Cortex A5, the results were equal. On the new machine (using the latest official aarch64 rustc nightly) ./binary_trees 23 produces the following results:

sysalloc 1m28s 5m10s 0m10s jemalloc 1m35s 5m10s 0m53s

which is palpably worse actually, even though Cortex-A53 is a much stronger core.

I'm beginning to think jemalloc only makes sense on Intel processors with heaps or L1/L2 cache.

More benchmark ideas welcome, though.

added retroactively: To reproduce, unpack the attachment and run:

cargo build --release && time target/release/binary_trees 23

inside the binary_trees directory. Uncomment the first 2 lines in main.rs to produce a sysalloc version.

MagaTailor commented 8 years ago

So, what happens if we run well optimized armv7 binaries on that system?

sysalloc 1m9s 3m59s 0m19s jemalloc 1m11s 3m58s 0m25s

Ouch!

EDIT: I did another comparison like this later, using armv7 binaries on aarch64, and for certain CPU bound workloads native code was 2-3 times faster. (even though, all else being equal, a 50% improvement is expected, probably 64-bit effect)

sorear commented 8 years ago

What precisely are you running, and what do the three numbers represent? All I can find is https://benchmarksgame.alioth.debian.org/u64q/program.php?test=binarytrees&lang=rust&id=1 , but the output is not similar to yours.

(Regarding the armv7 case … it's actually not unheard of for a 32-bit version of a program to be faster than the 64-bit version on 64-bit hardware. The reason is that the pointers are smaller -> data structures are smaller -> more of them fit in cache. Obviously this is highly workload-dependent.)

MagaTailor commented 8 years ago

On Sun, 26 Jun 2016 01:09:27 -0700 sorear notifications@github.com wrote:

What precisely are you running, and what do the three numbers represent? All I can find is https://benchmarksgame.alioth.debian.org/u64q/program.php?test=binarytrees&lang=rust&id=1 , but the output is not similar to yours.

Those were the timings.

(Regarding the armv7 case … it's actually not unheard of for a 32-bit version of a program to be faster than the 64-bit version on 64-bit hardware. The reason is that the pointers are smaller -> data structures are smaller -> more of them fit in cache. Obviously this is highly workload-dependent.)

Yes, but the relative difference, as I'd mentioned in the opening comment, was very small which means there's also a factor of LLVM backend maturity.

sorear commented 8 years ago

What do the aarch64 timings look like if you turn off memory return to the OS using MALLOC_CONF=lg_dirty_mult:-1 ? That helped last time I saw jemalloc using excessive sys time.

MagaTailor commented 8 years ago

Nice trick! Now it's jemalloc 1m19s 4m57s 0m2s but how does it compare to the default allocator's settings? And, by changing those, the issue stops being about any code generation comparisons.

Thanks to your tweak, the armv7 jemalloc binary, running on Cortex-A53, was able to catch up with sysalloc :) (1m9s 3m55 0m1s)

brson commented 8 years ago

I'd be in favor of turning jemalloc off everywhere except where it's already proven to be a win. Or everywhere period.

MagaTailor commented 8 years ago

@brson Now, that I've built rust on two different ARM architectures with --disable-jemalloc, I'd like to propose a configure switch inverting the current allocator defaults. In other words, use alloc_system by default, but also build the jemalloc crate.

The current disable switch makes it impossible to use jemalloc on a per crate basis, like this:

#![feature(alloc_jemalloc)]
extern crate alloc_jemalloc;

Or more simply --disable-jemalloccould start meaning just that.

brson commented 8 years ago

Or more simply --disable-jemalloccould start meaning just that.

sgtm

MagaTailor commented 8 years ago

The following news makes this issue much less interesting. Who knows what effect DVFS has under different loads.

http://www.cnx-software.com/2016/08/28/amlogic-s905-and-s912-processors-appear-to-be-limited-to-1-5-ghz-not-2-ghz-as-advertised/