Closed hsivonen closed 5 years ago
packed_simd
instead usesvmov.32
four times and them ORs them together on the ALU.
Are you using packed_simd
with the core_arch
feature enabled ? (EDIT: It appears that this is not the case: https://github.com/hsivonen/encoding_rs/blob/packed_simd/Cargo.toml#L17)
Only using into_bits
from the cargo features of packed_simd
.
You need to enable the core_arch
(previously called coresimd
feature) and compile with -C target-feature=+neon
for that to work.
This is because ARMv7
does not support neon
, so vpmax.u8
would trigger undefined behavior.
I'm closing this, feel free to re-open if using -C target-feature=+neon
and the core_arch
cargo feature do not fix this issue for you.
I'm not using -C target-feature=+neon
with packed_simd
because I'm using the thumbv7neon-unknown-linux-gnueabihf
target, which was supposed to make a NEON-enabled core::arch
available to packed_simd
.
When using the core_arch
feature, the code uses vpmax.u8
four times per vector and the result is still slower than simd
:
packed_simd, inline(never), core_arch test bench_mem_copy_ascii_to_ascii_1000 ... bench: 146,646 ns/iter (+/- 971) = 3409 MB/s
6cb36: f961 0a0f vld1.8 {d16-d17}, [r1]
6cb3a: efc9 2070 vshr.s8 q9, q8, #7
6cb3e: ff42 2aa3 vpmax.u8 d18, d18, d19
6cb42: ff42 2aa0 vpmax.u8 d18, d18, d16
6cb46: ff42 2aa0 vpmax.u8 d18, d18, d16
6cb4a: ff42 2aa0 vpmax.u8 d18, d18, d16
6cb4e: eed2 1b90 vmov.u8 r1, d18[0]
6cb52: 2900 cmp r1, #0
(GitHub doesn't show me UI to reopen this.)
So I think there are two bugs:
thumbv7neon-unknown-linux-gnueabihf
requires the core_arch
feature to use NEON.vpmax.u8
followed by vmov.u8
instead of 2 times vpmax.u8
followed by vmov.32
.Let's keep this issue about item 2.
Filed another issue about item 1.
2. 4 times
vpmax.u8
followed byvmov.u8
instead of 2 timesvpmax.u8
followed byvmov.32
.
That looks bad, I'll investigate.
Which vector type and which reductions are you using ?
m8x16
and using the any
reduction ?
@hsivonen is it this code here: https://github.com/hsivonen/encoding_rs/blob/54cf6eb37d26237c3d62268f2cb7f57609e800eb/src/simd_funcs.rs#L146 ?
It appears that the implementation in the simd crate (https://github.com/gnzlbg/simd-1/blob/master/src/arm/neon.rs#L589) does two max instead of 4 max, and then uses a bitcast.
is it this code here: https://github.com/hsivonen/encoding_rs/blob/54cf6eb37d26237c3d62268f2cb7f57609e800eb/src/simd_funcs.rs#L146 ?
Yes, that's the code.
So I've submitted a PR to fix this (https://github.com/rust-lang-nursery/packed_simd/pull/218). It only performs a single vpmax instead of two or four (it's hard to know what performs best without a system to benchmark this). I'm not sure yet if that's correct. I think so, but CI will catch any errors otherwise. You might want to check whether that branch performs better for you.
I experimented a bit in godbolt, and it seems that bitcasting to a 64-bit integer seems to be the sweet-spot (godbolt):
define i1 @any_no_neon(<16 x i8>) {
%2 = bitcast <16 x i8> %0 to i128
%3 = icmp ne i128 %2, 0
ret i1 %3
}
define i1 @any_neon_64bitcast(<16 x i8>) {
%2 = shufflevector
<16 x i8> %0, <16 x i8> undef,
<8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
%3 = shufflevector
<16 x i8> %0, <16 x i8> undef,
<8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
%4 = tail call <8 x i8> @llvm.arm.neon.vpmaxu.v8i8(<8 x i8> %2, <8 x i8> %3)
%5 = bitcast <8 x i8> %4 to i64
%6 = icmp ne i64 %5, 0
ret i1 %6
}
define i1 @any_neon_32bitcast(<16 x i8>) {
%2 = shufflevector
<16 x i8> %0, <16 x i8> undef,
<8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
%3 = shufflevector
<16 x i8> %0, <16 x i8> undef,
<8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
%4 = tail call <8 x i8> @llvm.arm.neon.vpmaxu.v8i8(<8 x i8> %2, <8 x i8> %3)
%5 = tail call <8 x i8> @llvm.arm.neon.vpmaxu.v8i8(<8 x i8> %4, <8 x i8> undef)
%6 = bitcast <8 x i8> %5 to <2 x i32>
%7 = extractelement <2 x i32> %6, i32 0
%8 = icmp ne i32 %7, 0
ret i1 %8
}
declare <8 x i8> @llvm.arm.neon.vpmaxu.v8i8(<8 x i8>, <8 x i8>)
produces
any_no_neon:
vmov d17, r2, r3
vmov d16, r0, r1
vmov.32 r0, d17[1]
vmov.32 r2, d17[0]
vmov.32 r1, d16[1]
vmov.32 r3, d16[0]
orr r0, r1, r0
orr r1, r3, r2
orrs r0, r1, r0
movne r0, #1
mov pc, lr
any_neon_64bitcast:
vmov d16, r2, r3
vmov d17, r0, r1
vpmax.u8 d16, d17, d16
vmov r0, r1, d16
orrs r0, r0, r1
movne r0, #1
mov pc, lr
any_neon_32bitcast:
vmov d16, r2, r3
vmov d17, r0, r1
vpmax.u8 d16, d17, d16
vpmax.u8 d16, d16, d16
vmov.32 r0, d16[0]
cmp r0, #0
movne r0, #1
mov pc, lr
which in terms of instruction count hints that the 64-bit bitcast is the best option.
So I ran this through LLVM MCA (godbolt). For 100 iterations on a cortex-a9, it predicts:
so it appears to confirm that the version with 1 vpmin and 1 bitcast to 64-bit int performs the best, on a cortex-a9 at least.
This bug should be fixed in version 0.3.3.
Thank you. With this change, I get:
59346: f961 0a0f vld1.8 {d16-d17}, [r1]
5934a: efc9 2070 vshr.s8 q9, q8, #7
5934e: ff42 2aa3 vpmax.u8 d18, d18, d19
59352: ec54 1b32 vmov r1, r4, d18
59356: 4321 orrs r1, r4
59358: d147 bne.n 593ea <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h8e9e53db7143347dE+0xc2>
In the case of copy_ascii_to_ascii
, it's still slower than the simd
crate on Exynos 5:
simd, inline(never) test bench_mem_copy_ascii_to_ascii_1000 ... bench: 120,045 ns/iter (+/- 686) = 4165 MB/s
packed_simd, inline(never), 0.3.3 test bench_mem_copy_ascii_to_ascii_1000 ... bench: 141,979 ns/iter (+/- 10,279) = 3521 MB/s
I'll run a broader benchmark next.
Did you happen to write the all_neon_32bitcat version into Rust code that I could try?
In the case of
copy_ascii_to_ascii
, it's still slower than thesimd
crate on Exynos 5:
What target CPU are you using ? LLVM MCA trunk only lists exynos-m1, ..., exynos-m4.
Did you happen to write the all_neon_32bitcat version into Rust code that I could try?
Not really, I wrote the LLVM IR directly, but it would be:
#![feature(
repr_simd, platform_intrinsics,
link_llvm_intrinsics, simd_ffi
)]
#![allow(non_camel_case_types, improper_ctypes)]
extern {
#[link_name = "llvm.arm.neon.vpmaxu.v8i8"]
fn vpmax_u8(a: m8x8, b: m8x8) -> m8x8;
}
#[repr(simd)]
#[derive(Copy, Clone)]
pub struct m8x16(
u8, u8, u8, u8, u8, u8, u8, u8,
u8, u8, u8, u8, u8, u8, u8, u8,
);
#[repr(simd)]
#[derive(Copy, Clone)]
pub struct m8x8(
u8, u8, u8, u8, u8, u8, u8, u8,
);
pub unsafe fn any(x: m8x16) -> bool {
union U { x: m8x16, y: (m8x8, m8x8) }
let (a, b) = U { x }.y;
let c = vpmax_u8(a, b);
let d = vpmax_u8(c, std::mem::uninitialized());
union V { d: m8x8, y: (u32, u32) }
let e = V { d }.y.0;
e != 0
}
You could add that somewhere to your project, and replace the .any()
call with that.
That generates on godbolt:
example::any:
vld1.64 {d0, d1}, [r0]
vpmax.u8 d0, d0, d1
vpmax.u8 d0, d0, d0
vmov.32 r0, d0[0]
cmp r0, #0
movwne r0, #1
bx lr
which is still slightly different from what you post that the simd
crate generates, but the LLVM-IR should be identical in both cases, at which point I'd assume that the change is due to a different LLVM version being used.
What target CPU are you using ?
Whatever the Rust default for thumbv7neon-unknown-linux-gnueabihf
is in the case of packed_simd
and whatever the Rust default is for armv7-unknown-linux-gnueabihf
plus RUSTFLAGS='-C target_feature=+neon,+thumb-mode,+thumb2'
in the case of simd
.
Whatever the Rust default for
thumbv7neon-unknown-linux-gnueabihf
is in the case ofpacked_simd
The default CPU of a target has nothing to do with which library are you compiling.
armv7-unknown-linux-gnueabihf
uses ageneric
CPU with +v7,+vfp3,+d16,+thumb2,-neon
and you are enabling +neon,+thumb-mode
(enabling +thumb2
does nothing since the target always enables it)
thumbv7neon-unknown-linux-gnueabihf
: uses a generic
CPU with +v7,+thumb-mode,+thumb2,+vfp3,+neon
(for some reason without +d16
).
Exynos5 has cortex-a15 cores, so you probably want to specify -C target-cpu=cortex-a15
, running that through LLVM MCA using that CPU produces the same results as before: https://godbolt.org/z/qy0EiN , so while we might be making an error somewhere, it is also possible that LLVM's scheduling profile for that CPU isn't perfect too (assuming we aren't screwing things up, like codegen, measurements, etc.).
You might want to fill in an LLVM bug for the ARM backend, showing the two asm snippets, your benchmark results, and the predictions of LLVM MCA.
Also please let me know if you manage to benchmark the "bitcast to 32-bit version".
I'm reopening this till we get to the bottom of this.
@hsivonen it would be great if you could extract the issue into its own minimal crate. We are comparing two different compiler versions using two different LLVM backends and also two different libraries generating slightly different LLVM-IR (simd
and packed_simd
). Worst of all, because simd
doesn't compile with Rust nightly (1.34) anymore, and packed_simd
does not compile with Rust 1.32, we can't really separate these things.
The default CPU of a target has nothing to do with which library are you compiling.
I meant that I tested simd
and packed_simd
with different targets, but, as you note, both targets have the same target CPU scheduling model.
Exynos5 has cortex-a15 cores, so you probably want to specify
-C target-cpu=cortex-a15
I'm not trying to target Exynos 5 specifically. I'm trying to write code for ARMv7 phones generally, so I'm compiling for the default/generic CPU. Exynos 5 just happens to be the only ARMv7 chip that I have available that
cargo bench
(no thermal throttling and such), so that benchmark runs are repeatable and mean something.Not really, I wrote the LLVM IR directly, but it would be
Thank you.
Also please let me know if you manage to benchmark the "bitcast to 32-bit version".
So far, I'm having trouble with this. For some reason, the implementation of any
causes test failures without printing any details of failure reason. Stepping doesn't work normally, and I have no idea why, so I've been unable to examine in gdb how the code behaves. I don't see anything wrong with the code you suggested, though.
Some time into the test run, there's a panic while panicking, and the process dies with illegal instruction. When asking gdb to give an assembly view of that, the instruction shown is in the panic implementation and seems to have nothing to do with SIMD.
Changing let d = vpmax_u8(c, std::mem::uninitialized());
to let d = vpmax_u8(c, c);
does not help, so it's not about the uninitialized
bit causing confusion in LLVM.
You might need to make any
#[target_feature(enable = "v7,neon")]
(maybe) ?
You might need to make
any
#[target_feature(enable = "v7,neon")]
(maybe) ?
I got the same result with that. :-(
Stepping doesn't work normally, and I have no idea why, so I've been unable to examine in gdb how the code behaves. I don't see anything wrong with the code you suggested, though.
Can you try with eprintln!
? E.g. get the implementation of any
to print the input, and print the output, and see if it does something weird. What you say looks like you are getting a double panic for some reason, but I am not sure what role could any
play in that (I don't think it plays any, but who knows, if the code has UB then anything can happen).
I've now tried the any
implemenation in three different ways:
std::arch
instead of linking LLVM intrinsics.simd
to use packed_simd
shuffles and LLVM intrinsics.In all three cases, the generated code is the same:
39f62: f961 0a0f vld1.8 {d16-d17}, [r1]
39f66: efc9 2070 vshr.s8 q9, q8, #7
39f6a: fff0 25e2 vmvn q9, q9
39f6e: ff42 2aa3 vpmax.u8 d18, d18, d19
39f72: ff42 2aa0 vpmax.u8 d18, d18, d16
39f76: ee12 1b90 vmov.32 r1, d18[0]
39f7a: 2900 cmp r1, #0
Compared to the code generated by Rust 1.32 for the simd
crate, the vmvn
instruction is extra. AFAICT, it negates the bits of the vector for no good reason, which is likely why the code fails tests but runs benches.
Any ideas where the vmvn
instruction could be coming from?
(I'll try eprintln()
next.)
Hah. I found the bug. Bad operator placement relative to parentheses.
Hah. I found the bug. Bad operator placement relative to parentheses.
What was exactly the issue? Is packed_simd
affected?
What was exactly the issue?
When I changed expressions of the form !something.any()
, I accidentally did any16x8(!something)
instead of !any16x8(something)
.
Is
packed_simd
affected?
No.
Ah I see. So what about the performance of the three approaches when compared to the old simd
crate ?
For the simplest case, I now get simd
-like perf:
test bench_mem_copy_ascii_to_ascii_1000 ... bench: 120,451 ns/iter (+/- 14,009) = 4151 MB/s
I'll run the full suite in a few different ways.
For the simplest case, I now get
simd
-like perf:
Is that using packed_simd
, or your custom implementations?
Is that using
packed_simd
, or your custom implementations?
packed_simd
, except for boolean reduction the code you suggested earlier in this thread (the two-vpmax
version).
Does this mean that we can close this? Or does the packed_simd
implementation have a perf bug ?
Or does the
packed_simd
implementation have a perf bug ?
It appears that packed_simd
has a perf bug.
So far (more benches still running), it looks like the code from your earlier comment, which matches the simd
crate for instructions generated in the m8x16
case, is faster than what packed_simd
0.3.3 has.
The simd
crate generates different code for reducing m16x8
. I'm also benching if that makes any difference.
I've sent a PR https://github.com/rust-lang-nursery/packed_simd/pull/219 that should generate the same code as the simd
crate. Please benchmark that and let me know the results.
It appears that
packed_simd
has a perf bug.
After more benchmarking, this is quite a bit less clear. :-/
The
simd
crate generates different code for reducingm16x8
. I'm also benching if that makes any difference.
Using vpmax_u16
twice to reduce m16x8
does not make a difference compared to using vpmax_u8
twice to reduce m16x8
(i.e. reducing as if bitcasting to m8x16
and reducing that).
I've sent a PR #219 that should generate the same code as the
simd
crate. Please benchmark that and let me know the results.
The results are mixed. Your PR is an improvement over the 0.3.3 release for the function I initially focused on upthread: copy_ascii_to_ascii
(among the simplest of the functions). Unfortunately, the PR regresses some other benchmarks.
packed_simd
0.3.3 with the functions in encoding_rs::mem
marked inline(never)
encoding_rs::mem
marked inline(never)
simd
with the functions in encoding_rs::mem
marked inline(never)
(inline(never)
to avoid distractions from the functions getting benchmarked getting inlined into the benchmarking harness in a way that confuses the results.)
Each result file was made by running the benchmark suite four times and choosing the best result for each bench from the four runs. (Best instead of average on the logic that stuff can't get accidentally too fast, so the runs that aren't the best have some interference from other processes that I don't know how to turn off.)
The results are from a Samsung Chromebook 2 running Ubuntu 18.04 on Crouton on Chrome OS 71.
/proc/cpuinfo
says:
processor : 0
model name : ARMv7 Processor rev 3 (v7l)
BogoMIPS : 48.00
Features : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpv4 idiva idivt
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part : 0xc0f
CPU revision : 3
processor : 1
model name : ARMv7 Processor rev 3 (v7l)
BogoMIPS : 48.00
Features : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpv4 idiva idivt
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part : 0xc0f
CPU revision : 3
processor : 2
model name : ARMv7 Processor rev 3 (v7l)
BogoMIPS : 48.00
Features : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpv4 idiva idivt
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part : 0xc0f
CPU revision : 3
processor : 3
model name : ARMv7 Processor rev 3 (v7l)
BogoMIPS : 48.00
Features : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpv4 idiva idivt
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part : 0xc0f
CPU revision : 3
Hardware : SAMSUNG EXYNOS5 (Flattened Device Tree)
Revision : 0000
Serial : 0000000000000000
uname -a
says: Linux localhost 3.8.11 #1 SMP Thu Jan 10 22:24:15 PST 2019 armv7l armv7l armv7l GNU/Linux
.
For the bench_mem_
benches, the ones that end with something other than _1000
are noise that optimizer decisions make go whichever way without any particular pattern that one could see from the code changes. Therefore, it helps to ignore those and to leave only the bench_mem_
items that end with _1000
.
$ cargo benchcmp --threshold 3 033_never.txt arm32bred_never.txt | grep -v '_1[^0]' | grep -v _3
name 033_never.txt ns/iter arm32bred_never.txt ns/iter diff ns/iter diff % speedup
bench_decode_to_string_ar 553,235 (568 MB/s) 533,966 (589 MB/s) -19,269 -3.48% x 1.04
bench_decode_to_string_ko_euc_kr 433,733 (364 MB/s) 412,185 (383 MB/s) -21,548 -4.97% x 1.05
bench_decode_to_utf16_de 295,431 (1079 MB/s) 304,664 (1046 MB/s) 9,233 3.13% x 0.97
bench_decode_to_utf16_en 426,945 (1654 MB/s) 440,716 (1603 MB/s) 13,771 3.23% x 0.97
bench_decode_to_utf16_fr 1,027,770 (914 MB/s) 1,059,029 (887 MB/s) 31,259 3.04% x 0.97
bench_decode_to_utf16_ja_shift_jis 608,687 (377 MB/s) 628,079 (365 MB/s) 19,392 3.19% x 0.97
bench_decode_to_utf16_jquery 39,082 (2218 MB/s) 44,301 (1957 MB/s) 5,219 13.35% x 0.88
bench_decode_to_utf16_th_windows_874 1,053,734 (763 MB/s) 1,132,539 (710 MB/s) 78,805 7.48% x 0.93
bench_decode_to_utf8_ar 605,716 (519 MB/s) 583,191 (539 MB/s) -22,525 -3.72% x 1.04
bench_decode_to_utf8_he 509,095 (516 MB/s) 493,475 (532 MB/s) -15,620 -3.07% x 1.03
bench_decode_to_utf8_jquery 44,591 (1944 MB/s) 46,551 (1862 MB/s) 1,960 4.40% x 0.96
bench_decode_to_utf8_ko_euc_kr 433,883 (364 MB/s) 411,766 (383 MB/s) -22,117 -5.10% x 1.05
bench_decode_to_utf8_zh_cn_gb18030 821,212 (358 MB/s) 793,208 (371 MB/s) -28,004 -3.41% x 1.04
bench_encode_from_utf16_ar 111,534 (206 MB/s) 107,325 (214 MB/s) -4,209 -3.77% x 1.04
bench_encode_from_utf16_cs 199,170 (243 MB/s) 206,366 (234 MB/s) 7,196 3.61% x 0.97
bench_encode_from_utf16_el 158,954 (204 MB/s) 152,994 (211 MB/s) -5,960 -3.75% x 1.04
bench_encode_from_utf16_th_windows_874 805,283 (82 MB/s) 869,270 (76 MB/s) 63,987 7.95% x 0.93
bench_encode_from_utf16_tr 231,040 (242 MB/s) 241,263 (232 MB/s) 10,223 4.42% x 0.96
bench_encode_from_utf16_zh_tw_big5 23,132,804 24,908,525 1,775,721 7.68% x 0.93
bench_encode_from_utf8_ar 1,890 (12192 MB/s) 2,438 (9452 MB/s) 548 28.99% x 0.78
bench_encode_from_utf8_cs 7,093 (6829 MB/s) 5,851 (8278 MB/s) -1,242 -17.51% x 1.21
bench_encode_from_utf8_de 11,207 (6864 MB/s) 10,782 (7134 MB/s) -425 -3.79% x 1.04
bench_encode_from_utf8_el 2,827 (11472 MB/s) 3,498 (9272 MB/s) 671 23.74% x 0.81
bench_encode_from_utf8_en 9,436 (6858 MB/s) 8,687 (7449 MB/s) -749 -7.94% x 1.09
bench_encode_from_utf8_fr 20,583 (6939 MB/s) 19,933 (7166 MB/s) -650 -3.16% x 1.03
bench_encode_from_utf8_he 7,633 (6491 MB/s) 7,059 (7018 MB/s) -574 -7.52% x 1.08
bench_encode_from_utf8_ja 2,870 (11384 MB/s) 4,280 (7633 MB/s) 1,410 49.13% x 0.67
bench_encode_from_utf8_jquery 12,718 (6817 MB/s) 13,172 (6582 MB/s) 454 3.57% x 0.97
bench_encode_from_utf8_ko 1,664 (11996 MB/s) 2,252 (8864 MB/s) 588 35.34% x 0.74
bench_encode_from_utf8_ru 17,480 (6692 MB/s) 16,555 (7066 MB/s) -925 -5.29% x 1.06
bench_encode_from_utf8_th 26,540 (6996 MB/s) 25,689 (7228 MB/s) -851 -3.21% x 1.03
bench_encode_from_utf8_tr 8,381 (6691 MB/s) 7,414 (7564 MB/s) -967 -11.54% x 1.13
bench_encode_from_utf8_zh_cn 2,645 (11677 MB/s) 3,194 (9670 MB/s) 549 20.76% x 0.83
bench_encode_from_utf8_zh_tw 2,625 (11790 MB/s) 3,356 (9221 MB/s) 731 27.85% x 0.78
bench_label_rs_xseucpkdfmtjapanese 255 235 -20 -7.84% x 1.09
bench_mem_check_utf16_for_latin1_and_bidi_de_1000 203,206 (4921 MB/s) 231,389 (4321 MB/s) 28,183 13.87% x 0.88
bench_mem_check_utf16_for_latin1_and_bidi_ja_1000 675,737 (1479 MB/s) 715,689 (1397 MB/s) 39,952 5.91% x 0.94
bench_mem_convert_str_to_utf16_ascii_1000 179,811 (2780 MB/s) 189,200 (2642 MB/s) 9,389 5.22% x 0.95
bench_mem_convert_str_to_utf16_bmp_1000 2,313,058 (216 MB/s) 2,386,712 (209 MB/s) 73,654 3.18% x 0.97
bench_mem_convert_utf16_to_str_bmp_1000 4,302,345 (232 MB/s) 4,125,145 (242 MB/s) -177,200 -4.12% x 1.04
bench_mem_convert_utf16_to_utf8_ascii_1000 213,731 (4678 MB/s) 220,881 (4527 MB/s) 7,150 3.35% x 0.97
bench_mem_convert_utf16_to_utf8_bmp_1000 4,284,716 (233 MB/s) 4,072,866 (245 MB/s) -211,850 -4.94% x 1.05
bench_mem_convert_utf8_to_utf16_ascii_1000 183,529 (2724 MB/s) 190,768 (2620 MB/s) 7,239 3.94% x 0.96
bench_mem_copy_ascii_to_ascii_1000 141,813 (3525 MB/s) 120,232 (4158 MB/s) -21,581 -15.22% x 1.18
bench_mem_copy_ascii_to_basic_latin_1000 181,765 (2750 MB/s) 189,377 (2640 MB/s) 7,612 4.19% x 0.96
bench_mem_ensure_utf16_validity_ascii_1000 211,703 (4723 MB/s) 232,711 (4297 MB/s) 21,008 9.92% x 0.91
bench_mem_ensure_utf16_validity_bmp_1000 211,654 (4724 MB/s) 232,717 (4297 MB/s) 21,063 9.95% x 0.91
bench_mem_ensure_utf16_validity_latin1_1000 211,638 (4725 MB/s) 232,710 (4297 MB/s) 21,072 9.96% x 0.91
bench_mem_is_str_latin1_false_1000 3,274 (152871 MB/s) 3,529 (141824 MB/s) 255 7.79% x 0.93
bench_mem_is_str_latin1_true_1000 123,157 (4059 MB/s) 113,497 (4405 MB/s) -9,660 -7.84% x 1.09
bench_mem_is_utf16_bidi_de_1000 260,038 (3845 MB/s) 218,217 (4582 MB/s) -41,821 -16.08% x 1.19
bench_mem_is_utf16_bidi_ja_1000 683,179 (1463 MB/s) 716,270 (1396 MB/s) 33,091 4.84% x 0.95
bench_mem_is_utf16_bidi_ru_1000 401,691 (2489 MB/s) 374,750 (2668 MB/s) -26,941 -6.71% x 1.07
bench_mem_is_utf16_latin1_true_1000 118,743 (8421 MB/s) 123,255 (8113 MB/s) 4,512 3.80% x 0.96
bench_mem_is_utf8_bidi_de_1000 387,101 (1291 MB/s) 368,808 (1355 MB/s) -18,293 -4.73% x 1.05
bench_mem_is_utf8_latin1_false_1000 5,282 (94661 MB/s) 5,538 (90285 MB/s) 256 4.85% x 0.95
bench_mem_utf16_valid_up_to_ascii_1000 210,709 (4745 MB/s) 232,988 (4292 MB/s) 22,279 10.57% x 0.90
bench_mem_utf16_valid_up_to_bmp_1000 210,710 (4745 MB/s) 232,958 (4292 MB/s) 22,248 10.56% x 0.90
bench_mem_utf16_valid_up_to_latin1_1000 210,733 (4745 MB/s) 232,968 (4292 MB/s) 22,235 10.55% x 0.90
It's hard to make a call either way. Do you see a pattern that helps you call one clearly better than the other?
Curiously, when doing changes like testing between vpmax_u8
vs. vpmax_u16
for m16x8
reduction, the results were remarkably noiseless. (Benchmark runs on this computer are remarkable repeatable, so changes over 2% either way after picking the per-bench best of 4 runs do generally implicate difference in code performance and not just randomness in benchmarking runs.) Yet, comparing simd
vs. your PR, there are rather large changes even though the instructions generated for the reductions are the same. This suggests that the overall structural differences between simd
and packed_simd
could cause some semi-chaotic and hard to articulate differences in how the optimization passes happen to do stuff.
$ cargo benchcmp --threshold 3 simd_never.txt arm32bred_never.txt | grep -v '_1[^0]' | grep -v _3
name simd_never.txt ns/iter arm32bred_never.txt ns/iter diff ns/iter diff % speedup
bench_decode_to_string_ar 560,170 (561 MB/s) 533,966 (589 MB/s) -26,204 -4.68% x 1.05
bench_decode_to_string_el 365,064 (664 MB/s) 349,477 (693 MB/s) -15,587 -4.27% x 1.04
bench_decode_to_string_he 473,933 (554 MB/s) 452,095 (581 MB/s) -21,838 -4.61% x 1.05
bench_decode_to_string_jquerycat 523,681 (2649 MB/s) 565,655 (2452 MB/s) 41,974 8.02% x 0.93
bench_decode_to_string_ru 1,063,606 (590 MB/s) 1,020,616 (615 MB/s) -42,990 -4.04% x 1.04
bench_decode_to_utf16_ja_iso_2022_jp 1,592,987 (154 MB/s) 1,840,083 (134 MB/s) 247,096 15.51% x 0.87
bench_decode_to_utf16_jquery 38,801 (2234 MB/s) 44,301 (1957 MB/s) 5,500 14.17% x 0.88
bench_decode_to_utf8_ar 609,612 (516 MB/s) 583,191 (539 MB/s) -26,421 -4.33% x 1.05
bench_decode_to_utf8_el 403,713 (600 MB/s) 386,847 (626 MB/s) -16,866 -4.18% x 1.04
bench_decode_to_utf8_he 514,750 (510 MB/s) 493,475 (532 MB/s) -21,275 -4.13% x 1.04
bench_decode_to_utf8_jquery 43,646 (1986 MB/s) 46,551 (1862 MB/s) 2,905 6.66% x 0.94
bench_decode_to_utf8_ru 1,182,731 (531 MB/s) 1,136,625 (552 MB/s) -46,106 -3.90% x 1.04
bench_decode_to_utf8_zh_tw 401,191 (779 MB/s) 388,837 (803 MB/s) -12,354 -3.08% x 1.03
bench_encode_from_utf16_ar 110,713 (208 MB/s) 107,325 (214 MB/s) -3,388 -3.06% x 1.03
bench_encode_from_utf16_cs 199,214 (243 MB/s) 206,366 (234 MB/s) 7,152 3.59% x 0.97
bench_encode_from_utf16_en 53,005 (1220 MB/s) 55,295 (1170 MB/s) 2,290 4.32% x 0.96
bench_encode_from_utf16_ja_euc_jp 8,269,645 (2 MB/s) 9,796,333 (2 MB/s) 1,526,688 18.46% x 0.84
bench_encode_from_utf16_ja_iso_2022_jp 8,811,931 (2 MB/s) 10,315,491 (2 MB/s) 1,503,560 17.06% x 0.85
bench_encode_from_utf16_ja_shift_jis 8,964,035 (2 MB/s) 10,420,600 (2 MB/s) 1,456,565 16.25% x 0.86
bench_encode_from_utf16_jquery 53,468 (1621 MB/s) 46,969 (1846 MB/s) -6,499 -12.15% x 1.14
bench_encode_from_utf16_th_windows_874 813,704 (81 MB/s) 869,270 (76 MB/s) 55,566 6.83% x 0.94
bench_encode_from_utf16_zh_cn_gb18030 25,062,645 29,195,487 4,132,842 16.49% x 0.86
bench_encode_from_utf16_zh_tw_big5 21,373,295 24,908,525 3,535,230 16.54% x 0.86
bench_encode_from_utf8_ar 1,983 (11620 MB/s) 2,438 (9452 MB/s) 455 22.95% x 0.81
bench_encode_from_utf8_cs 6,984 (6935 MB/s) 5,851 (8278 MB/s) -1,133 -16.22% x 1.19
bench_encode_from_utf8_el 2,839 (11424 MB/s) 3,498 (9272 MB/s) 659 23.21% x 0.81
bench_encode_from_utf8_en 9,385 (6895 MB/s) 8,687 (7449 MB/s) -698 -7.44% x 1.08
bench_encode_from_utf8_fr 20,567 (6945 MB/s) 19,933 (7166 MB/s) -634 -3.08% x 1.03
bench_encode_from_utf8_ja 3,038 (10754 MB/s) 4,280 (7633 MB/s) 1,242 40.88% x 0.71
bench_encode_from_utf8_ja_euc_jp 8,298,254 (2 MB/s) 9,817,691 (2 MB/s) 1,519,437 18.31% x 0.85
bench_encode_from_utf8_ja_iso_2022_jp 9,551,929 (2 MB/s) 10,448,937 (2 MB/s) 897,008 9.39% x 0.91
bench_encode_from_utf8_ja_shift_jis 8,981,893 (2 MB/s) 10,428,791 (2 MB/s) 1,446,898 16.11% x 0.86
bench_encode_from_utf8_jquery 12,487 (6943 MB/s) 13,172 (6582 MB/s) 685 5.49% x 0.95
bench_encode_from_utf8_ko 1,639 (12179 MB/s) 2,252 (8864 MB/s) 613 37.40% x 0.73
bench_encode_from_utf8_ru 17,514 (6679 MB/s) 16,555 (7066 MB/s) -959 -5.48% x 1.06
bench_encode_from_utf8_th_windows_874 984,616 (67 MB/s) 1,016,300 (65 MB/s) 31,684 3.22% x 0.97
bench_encode_from_utf8_tr 8,371 (6699 MB/s) 7,414 (7564 MB/s) -957 -11.43% x 1.13
bench_encode_from_utf8_vi 12,106 (7164 MB/s) 12,602 (6882 MB/s) 496 4.10% x 0.96
bench_encode_from_utf8_zh_cn 2,673 (11554 MB/s) 3,194 (9670 MB/s) 521 19.49% x 0.84
bench_encode_from_utf8_zh_cn_gb18030 25,101,808 29,192,154 4,090,346 16.30% x 0.86
bench_encode_from_utf8_zh_tw 2,725 (11357 MB/s) 3,356 (9221 MB/s) 631 23.16% x 0.81
bench_encode_from_utf8_zh_tw_big5 21,374,658 24,891,266 3,516,608 16.45% x 0.86
bench_encode_to_vec_ja_euc_jp 8,299,658 (2 MB/s) 9,817,745 (2 MB/s) 1,518,087 18.29% x 0.85
bench_encode_to_vec_ja_iso_2022_jp 9,553,366 (2 MB/s) 10,448,916 (2 MB/s) 895,550 9.37% x 0.91
bench_encode_to_vec_ja_shift_jis 8,983,749 (2 MB/s) 10,431,337 (2 MB/s) 1,447,588 16.11% x 0.86
bench_encode_to_vec_th_windows_874 985,458 (67 MB/s) 1,016,600 (65 MB/s) 31,142 3.16% x 0.97
bench_encode_to_vec_zh_cn_gb18030 25,103,900 29,198,329 4,094,429 16.31% x 0.86
bench_encode_to_vec_zh_tw_big5 21,371,116 24,899,750 3,528,634 16.51% x 0.86
bench_label_rs_cseucpkdfmtjapanesx 119 130 11 9.24% x 0.92
bench_label_rs_utf_8 87 80 -7 -8.05% x 1.09
bench_label_rs_utf_8_upper 83 80 -3 -3.61% x 1.04
bench_label_rs_xseucpkdfmtjapanese 222 235 13 5.86% x 0.94
bench_mem_check_utf16_for_latin1_and_bidi_ja_1000 1,202,808 (831 MB/s) 715,689 (1397 MB/s) -487,119 -40.50% x 1.68
bench_mem_check_utf16_for_latin1_and_bidi_ru_1000 439,641 (2274 MB/s) 353,064 (2832 MB/s) -86,577 -19.69% x 1.25
bench_mem_check_utf16_for_latin1_and_bidi_th_1000 945,079 (1058 MB/s) 498,152 (2007 MB/s) -446,927 -47.29% x 1.90
bench_mem_convert_latin1_to_utf8_1000 610,233 (819 MB/s) 639,258 (782 MB/s) 29,025 4.76% x 0.95
bench_mem_convert_utf16_to_str_bmp_1000 4,311,470 (231 MB/s) 4,125,145 (242 MB/s) -186,325 -4.32% x 1.05
bench_mem_convert_utf16_to_utf8_bmp_1000 4,323,800 (231 MB/s) 4,072,866 (245 MB/s) -250,934 -5.80% x 1.06
bench_mem_copy_ascii_to_ascii_1000 126,927 (3939 MB/s) 120,232 (4158 MB/s) -6,695 -5.27% x 1.06
bench_mem_is_ascii_false_1000 7,284 (68643 MB/s) 7,039 (71032 MB/s) -245 -3.36% x 1.03
bench_mem_is_ascii_true_1000 76,802 (6510 MB/s) 70,546 (7087 MB/s) -6,256 -8.15% x 1.09
bench_mem_is_basic_latin_false_1000 7,213 (138638 MB/s) 6,409 (156030 MB/s) -804 -11.15% x 1.13
bench_mem_is_str_bidi_de_1000 369,654 (1352 MB/s) 383,643 (1303 MB/s) 13,989 3.78% x 0.96
bench_mem_is_str_bidi_ru_1000 2,017,245 (247 MB/s) 2,107,116 (237 MB/s) 89,871 4.46% x 0.96
bench_mem_is_str_latin1_false_1000 3,278 (152684 MB/s) 3,529 (141824 MB/s) 251 7.66% x 0.93
bench_mem_is_utf16_bidi_ja_1000 1,186,595 (842 MB/s) 716,270 (1396 MB/s) -470,325 -39.64% x 1.66
bench_mem_is_utf16_bidi_ru_1000 438,825 (2278 MB/s) 374,750 (2668 MB/s) -64,075 -14.60% x 1.17
bench_mem_is_utf16_bidi_th_1000 945,095 (1058 MB/s) 498,152 (2007 MB/s) -446,943 -47.29% x 1.90
bench_mem_is_utf8_bidi_de_1000 399,627 (1251 MB/s) 368,808 (1355 MB/s) -30,819 -7.71% x 1.08
bench_mem_is_utf8_bidi_th_1000 2,164,933 (230 MB/s) 1,677,325 (298 MB/s) -487,608 -22.52% x 1.29
bench_mem_is_utf8_latin1_false_1000 5,036 (99285 MB/s) 5,538 (90285 MB/s) 502 9.97% x 0.91
Do you see a pattern that helps you call one clearly better than the other?
Not really, it seems, however, that the current 0.3.3 version and the PR aren't really consistently worse than the simd, just different. Why? Slightly different IR being generated, different versions of the LLVM backend, compiler, etc.
If you want to squeeze the maximum out of this, the best thing you could do is, e.g., take the best results of the 0.3.3
branch, and the best results of the simd
crate, and merge them into a single benchmark file. Then take my PR branch, and try to modify it to be either better or have only tiny regressions on your armv7+neon target, e.g., using bench-cmp
to visualize the results.
Modifying the code generated is not hard. See the PR for the diff, there is a macro that allows you to pattern match on the mask vector types, and implement the all
and any
operations however you see fit. I am on Discord and Zulip and can help you to try to improve the performance on these targets, but without the ability to benchmark this is not something that I can do myself (i'm basically making changes blindly based on gut feeling and LLVM MCA, which does not appear to be very accurate).
Thanks.
I had hoped that two vpmax
operations followed by extraction to one ALU register would have been either clearly better or clearly worse than one vpmax
operation followed by a simultaneous extraction to two ALU registers.
Now that it's a mix, considering other schedule constraints, I think I shouldn't go down the rathole of trying to see if there's a way to somehow have the best of both options. (That seems like a relatively open-ended exploration with the likely outcome there isn't a way for the library to do things so that it always works out optimally for the callers.)
So let's keep packed_simd
in its current state, which is also supported by LLVM MCA (one vpmax
followed by a simultaneous extraction to two ALU registers). I consider the initial report (four vpmax
operations being inefficient) now fixed.
Thank you for the help and sorry about my focus on a simple case, bench_mem_copy_ascii_to_ascii_1000
, not generalizing and causing work that probably won't land.
If you are able to extract minimal examples that do not generate optimal code, but should, we can fill in LLVM bugs. For that we need to know how the optimal code looks like though.
Steps to reproduce
simd
branchsrc/mem.rs
to changecopy_ascii_to_ascii
toinline(never)
objdump -d
the resultpacked_simd
branchsrc/mem.rs
to changecopy_ascii_to_ascii
toinline(never)
objdump -d
the resultActual results
With the old
simd
crate, boolean reductions on ARMv7 usevpmax.u8
twice to fold the vector onto itself and then usesvmov.32
once.packed_simd
instead usesvmov.32
four times and them ORs them together on the ALU.simd
:packed_simd
:Expected results
Expected
packed_simd
to implement horizontal boolean reductions on ARMv7 the same way assimd
.