Closed MagaTailor closed 8 years ago
So, what happens if we run well optimized armv7
binaries on that system?
sysalloc
1m9s 3m59s 0m19s
jemalloc
1m11s 3m58s 0m25s
Ouch!
EDIT:
I did another comparison like this later, using armv7
binaries on aarch64
, and for certain CPU bound workloads native code was 2-3 times faster. (even though, all else being equal, a 50% improvement is expected, probably 64-bit effect)
What precisely are you running, and what do the three numbers represent? All I can find is https://benchmarksgame.alioth.debian.org/u64q/program.php?test=binarytrees&lang=rust&id=1 , but the output is not similar to yours.
(Regarding the armv7 case … it's actually not unheard of for a 32-bit version of a program to be faster than the 64-bit version on 64-bit hardware. The reason is that the pointers are smaller -> data structures are smaller -> more of them fit in cache. Obviously this is highly workload-dependent.)
On Sun, 26 Jun 2016 01:09:27 -0700 sorear notifications@github.com wrote:
What precisely are you running, and what do the three numbers represent? All I can find is https://benchmarksgame.alioth.debian.org/u64q/program.php?test=binarytrees&lang=rust&id=1 , but the output is not similar to yours.
Those were the timings.
(Regarding the armv7 case … it's actually not unheard of for a 32-bit version of a program to be faster than the 64-bit version on 64-bit hardware. The reason is that the pointers are smaller -> data structures are smaller -> more of them fit in cache. Obviously this is highly workload-dependent.)
Yes, but the relative difference, as I'd mentioned in the opening comment, was very small which means there's also a factor of LLVM backend maturity.
What do the aarch64
timings look like if you turn off memory return to the OS using MALLOC_CONF=lg_dirty_mult:-1
? That helped last time I saw jemalloc using excessive sys
time.
Nice trick! Now it's jemalloc
1m19s 4m57s 0m2s but how does it compare to the default allocator's settings? And, by changing those, the issue stops being about any code generation comparisons.
Thanks to your tweak, the armv7
jemalloc binary, running on Cortex-A53, was able to catch up with sysalloc :) (1m9s 3m55 0m1s)
I'd be in favor of turning jemalloc off everywhere except where it's already proven to be a win. Or everywhere period.
@brson Now, that I've built rust on two different ARM architectures with --disable-jemalloc
, I'd like to propose a configure
switch inverting the current allocator defaults. In other words, use alloc_system by default, but also build the jemalloc crate.
The current disable switch makes it impossible to use jemalloc on a per crate basis, like this:
#![feature(alloc_jemalloc)]
extern crate alloc_jemalloc;
Or more simply --disable-jemalloc
could start meaning just that.
Or more simply --disable-jemalloccould start meaning just that.
sgtm
The following news makes this issue much less interesting. Who knows what effect DVFS has under different loads.
I've just run the
binary_trees
benchmark on anARMv8
, Cortex-A53 processor, having converted an Android TV box to Linux.I'd found previously, on a much weaker (but more power efficient)
armv7
Cortex A5, the results were equal. On the new machine (using the latest officialaarch64
rustc nightly)./binary_trees 23
produces the following results:sysalloc
1m28s 5m10s 0m10sjemalloc
1m35s 5m10s 0m53swhich is palpably worse actually, even though Cortex-A53 is a much stronger core.
I'm beginning to think
jemalloc
only makes sense on Intel processors with heaps or L1/L2 cache.More benchmark ideas welcome, though.
added retroactively: To reproduce, unpack the attachment and run:
inside the binary_trees directory. Uncomment the first 2 lines in main.rs to produce a sysalloc version.