rust-lang / packed_simd

Portable Packed SIMD Vectors for Rust standard library
https://rust-lang.github.io/packed_simd/packed_simd_2/
Apache License 2.0
602 stars 74 forks source link

Inefficient horizontal boolean reductions for ARMv7 in Thumb2 mode #215

Closed hsivonen closed 5 years ago

hsivonen commented 5 years ago

Steps to reproduce

  1. Clone https://github.com/hsivonen/encoding_rs
  2. Checkout the simd branch
  3. Edit src/mem.rs to change copy_ascii_to_ascii to inline(never)
  4. Compile in release mode
  5. objdump -d the result
  6. Checkout the packed_simd branch
  7. Edit src/mem.rs to change copy_ascii_to_ascii to inline(never)
  8. Compile in release mode
  9. objdump -d the result

Actual results

With the old simd crate, boolean reductions on ARMv7 use vpmax.u8 twice to fold the vector onto itself and then uses vmov.32 once. packed_simd instead uses vmov.32 four times and them ORs them together on the ALU.

simd:

   6c17e:   f921 0a0f   vld1.8  {d0-d1}, [r1]
   6c182:   ef89 2050   vshr.s8 q1, q0, #7
   6c186:   ff02 2a03   vpmax.u8    d2, d2, d3
   6c18a:   ff02 2a00   vpmax.u8    d2, d2, d0
   6c18e:   ee12 1b10   vmov.32 r1, d2[0]
   6c192:   2900        cmp r1, #0

packed_simd:

   56334:   f961 0a0f   vld1.8  {d16-d17}, [r1]
   56338:   efc9 2070   vshr.s8 q9, q8, #7
   5633c:   ee33 1b90   vmov.32 r1, d19[1]
   56340:   ee32 4b90   vmov.32 r4, d18[1]
   56344:   ee13 5b90   vmov.32 r5, d19[0]
   56348:   ee12 6b90   vmov.32 r6, d18[0]
   5634c:   4321        orrs    r1, r4
   5634e:   ea46 0405   orr.w   r4, r6, r5
   56352:   4321        orrs    r1, r4

Expected results

Expected packed_simd to implement horizontal boolean reductions on ARMv7 the same way as simd.

gnzlbg commented 5 years ago

packed_simd instead uses vmov.32 four times and them ORs them together on the ALU.

Are you using packed_simd with the core_arch feature enabled ? (EDIT: It appears that this is not the case: https://github.com/hsivonen/encoding_rs/blob/packed_simd/Cargo.toml#L17)

hsivonen commented 5 years ago

Only using into_bits from the cargo features of packed_simd.

gnzlbg commented 5 years ago

You need to enable the core_arch (previously called coresimd feature) and compile with -C target-feature=+neon for that to work.

This is because ARMv7 does not support neon, so vpmax.u8 would trigger undefined behavior.

gnzlbg commented 5 years ago

I'm closing this, feel free to re-open if using -C target-feature=+neon and the core_arch cargo feature do not fix this issue for you.

hsivonen commented 5 years ago

I'm not using -C target-feature=+neon with packed_simd because I'm using the thumbv7neon-unknown-linux-gnueabihf target, which was supposed to make a NEON-enabled core::arch available to packed_simd.

When using the core_arch feature, the code uses vpmax.u8 four times per vector and the result is still slower than simd:

packed_simd, inline(never), core_arch test bench_mem_copy_ascii_to_ascii_1000 ... bench: 146,646 ns/iter (+/- 971) = 3409 MB/s

   6cb36:   f961 0a0f   vld1.8  {d16-d17}, [r1]
   6cb3a:   efc9 2070   vshr.s8 q9, q8, #7
   6cb3e:   ff42 2aa3   vpmax.u8    d18, d18, d19
   6cb42:   ff42 2aa0   vpmax.u8    d18, d18, d16
   6cb46:   ff42 2aa0   vpmax.u8    d18, d18, d16
   6cb4a:   ff42 2aa0   vpmax.u8    d18, d18, d16
   6cb4e:   eed2 1b90   vmov.u8 r1, d18[0]
   6cb52:   2900        cmp r1, #0
hsivonen commented 5 years ago

(GitHub doesn't show me UI to reopen this.)

hsivonen commented 5 years ago

So I think there are two bugs:

  1. thumbv7neon-unknown-linux-gnueabihf requires the core_arch feature to use NEON.
  2. 4 times vpmax.u8 followed by vmov.u8 instead of 2 times vpmax.u8 followed by vmov.32.

Let's keep this issue about item 2.

hsivonen commented 5 years ago

Filed another issue about item 1.

gnzlbg commented 5 years ago

2. 4 times vpmax.u8 followed by vmov.u8 instead of 2 times vpmax.u8 followed by vmov.32.

That looks bad, I'll investigate.

gnzlbg commented 5 years ago

Which vector type and which reductions are you using ?

gnzlbg commented 5 years ago

m8x16 and using the any reduction ?

gnzlbg commented 5 years ago

@hsivonen is it this code here: https://github.com/hsivonen/encoding_rs/blob/54cf6eb37d26237c3d62268f2cb7f57609e800eb/src/simd_funcs.rs#L146 ?

gnzlbg commented 5 years ago

It appears that the implementation in the simd crate (https://github.com/gnzlbg/simd-1/blob/master/src/arm/neon.rs#L589) does two max instead of 4 max, and then uses a bitcast.

hsivonen commented 5 years ago

is it this code here: https://github.com/hsivonen/encoding_rs/blob/54cf6eb37d26237c3d62268f2cb7f57609e800eb/src/simd_funcs.rs#L146 ?

Yes, that's the code.

gnzlbg commented 5 years ago

So I've submitted a PR to fix this (https://github.com/rust-lang-nursery/packed_simd/pull/218). It only performs a single vpmax instead of two or four (it's hard to know what performs best without a system to benchmark this). I'm not sure yet if that's correct. I think so, but CI will catch any errors otherwise. You might want to check whether that branch performs better for you.

gnzlbg commented 5 years ago

I experimented a bit in godbolt, and it seems that bitcasting to a 64-bit integer seems to be the sweet-spot (godbolt):

define i1 @any_no_neon(<16 x i8>)  {
    %2 = bitcast <16 x i8> %0 to i128
    %3 = icmp ne i128 %2, 0
    ret i1 %3
}

define i1 @any_neon_64bitcast(<16 x i8>)  {
    %2 = shufflevector 
        <16 x i8> %0, <16 x i8> undef, 
        <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
    %3 = shufflevector 
        <16 x i8> %0, <16 x i8> undef, 
        <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
    %4 = tail call <8 x i8> @llvm.arm.neon.vpmaxu.v8i8(<8 x i8> %2, <8 x i8> %3)                
    %5 = bitcast <8 x i8> %4 to i64
    %6 = icmp ne i64 %5, 0
    ret i1 %6
}

define i1 @any_neon_32bitcast(<16 x i8>)  {
    %2 = shufflevector 
        <16 x i8> %0, <16 x i8> undef, 
        <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
    %3 = shufflevector 
        <16 x i8> %0, <16 x i8> undef, 
        <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
    %4 = tail call <8 x i8> @llvm.arm.neon.vpmaxu.v8i8(<8 x i8> %2, <8 x i8> %3)
    %5 = tail call <8 x i8> @llvm.arm.neon.vpmaxu.v8i8(<8 x i8> %4, <8 x i8> undef) 
    %6 = bitcast <8 x i8> %5 to <2 x i32>         
    %7 = extractelement <2 x i32> %6, i32 0
    %8 = icmp ne i32 %7, 0
    ret i1 %8
}

declare <8 x i8> @llvm.arm.neon.vpmaxu.v8i8(<8 x i8>, <8 x i8>)

produces

any_no_neon:
        vmov    d17, r2, r3
        vmov    d16, r0, r1
        vmov.32 r0, d17[1]
        vmov.32 r2, d17[0]
        vmov.32 r1, d16[1]
        vmov.32 r3, d16[0]
        orr     r0, r1, r0
        orr     r1, r3, r2
        orrs    r0, r1, r0
        movne   r0, #1
        mov     pc, lr
any_neon_64bitcast:
        vmov    d16, r2, r3
        vmov    d17, r0, r1
        vpmax.u8        d16, d17, d16
        vmov    r0, r1, d16
        orrs    r0, r0, r1
        movne   r0, #1
        mov     pc, lr
any_neon_32bitcast:
        vmov    d16, r2, r3
        vmov    d17, r0, r1
        vpmax.u8        d16, d17, d16
        vpmax.u8        d16, d16, d16
        vmov.32 r0, d16[0]
        cmp     r0, #0
        movne   r0, #1
        mov     pc, lr

which in terms of instruction count hints that the 64-bit bitcast is the best option.

gnzlbg commented 5 years ago

So I ran this through LLVM MCA (godbolt). For 100 iterations on a cortex-a9, it predicts:

so it appears to confirm that the version with 1 vpmin and 1 bitcast to 64-bit int performs the best, on a cortex-a9 at least.

gnzlbg commented 5 years ago

This bug should be fixed in version 0.3.3.

hsivonen commented 5 years ago

Thank you. With this change, I get:

   59346:   f961 0a0f   vld1.8  {d16-d17}, [r1]
   5934a:   efc9 2070   vshr.s8 q9, q8, #7
   5934e:   ff42 2aa3   vpmax.u8    d18, d18, d19
   59352:   ec54 1b32   vmov    r1, r4, d18
   59356:   4321        orrs    r1, r4
   59358:   d147        bne.n   593ea <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h8e9e53db7143347dE+0xc2>

In the case of copy_ascii_to_ascii, it's still slower than the simd crate on Exynos 5:

simd, inline(never) test bench_mem_copy_ascii_to_ascii_1000 ... bench: 120,045 ns/iter (+/- 686) = 4165 MB/s

packed_simd, inline(never), 0.3.3 test bench_mem_copy_ascii_to_ascii_1000 ... bench: 141,979 ns/iter (+/- 10,279) = 3521 MB/s

I'll run a broader benchmark next.

Did you happen to write the all_neon_32bitcat version into Rust code that I could try?

gnzlbg commented 5 years ago

In the case of copy_ascii_to_ascii, it's still slower than the simd crate on Exynos 5:

What target CPU are you using ? LLVM MCA trunk only lists exynos-m1, ..., exynos-m4.

Did you happen to write the all_neon_32bitcat version into Rust code that I could try?

Not really, I wrote the LLVM IR directly, but it would be:

#![feature(
    repr_simd, platform_intrinsics, 
    link_llvm_intrinsics, simd_ffi
)]
#![allow(non_camel_case_types, improper_ctypes)]

extern { 
#[link_name = "llvm.arm.neon.vpmaxu.v8i8"]
fn vpmax_u8(a: m8x8, b: m8x8) -> m8x8;
}

#[repr(simd)]
#[derive(Copy, Clone)]
pub struct m8x16(
    u8, u8, u8, u8, u8, u8, u8, u8,
    u8, u8, u8, u8, u8, u8, u8, u8,
);

#[repr(simd)]
#[derive(Copy, Clone)]
pub struct m8x8(
    u8, u8, u8, u8, u8, u8, u8, u8, 
);

pub unsafe fn any(x: m8x16) -> bool {
    union U { x: m8x16, y: (m8x8, m8x8) }
    let (a, b) = U { x }.y;
    let c = vpmax_u8(a, b);
    let d = vpmax_u8(c, std::mem::uninitialized());
    union V { d: m8x8, y: (u32, u32) }
    let e = V { d }.y.0;
    e != 0
}

You could add that somewhere to your project, and replace the .any() call with that.

That generates on godbolt:

example::any:
  vld1.64 {d0, d1}, [r0]
  vpmax.u8 d0, d0, d1
  vpmax.u8 d0, d0, d0
  vmov.32 r0, d0[0]
  cmp r0, #0
  movwne r0, #1
  bx lr

which is still slightly different from what you post that the simd crate generates, but the LLVM-IR should be identical in both cases, at which point I'd assume that the change is due to a different LLVM version being used.

hsivonen commented 5 years ago

What target CPU are you using ?

Whatever the Rust default for thumbv7neon-unknown-linux-gnueabihf is in the case of packed_simd and whatever the Rust default is for armv7-unknown-linux-gnueabihf plus RUSTFLAGS='-C target_feature=+neon,+thumb-mode,+thumb2' in the case of simd.

gnzlbg commented 5 years ago

Whatever the Rust default for thumbv7neon-unknown-linux-gnueabihf is in the case of packed_simd

The default CPU of a target has nothing to do with which library are you compiling.

Exynos5 has cortex-a15 cores, so you probably want to specify -C target-cpu=cortex-a15, running that through LLVM MCA using that CPU produces the same results as before: https://godbolt.org/z/qy0EiN , so while we might be making an error somewhere, it is also possible that LLVM's scheduling profile for that CPU isn't perfect too (assuming we aren't screwing things up, like codegen, measurements, etc.).

You might want to fill in an LLVM bug for the ARM backend, showing the two asm snippets, your benchmark results, and the predictions of LLVM MCA.

gnzlbg commented 5 years ago

Also please let me know if you manage to benchmark the "bitcast to 32-bit version".

gnzlbg commented 5 years ago

I'm reopening this till we get to the bottom of this.

@hsivonen it would be great if you could extract the issue into its own minimal crate. We are comparing two different compiler versions using two different LLVM backends and also two different libraries generating slightly different LLVM-IR (simd and packed_simd). Worst of all, because simd doesn't compile with Rust nightly (1.34) anymore, and packed_simd does not compile with Rust 1.32, we can't really separate these things.

hsivonen commented 5 years ago

The default CPU of a target has nothing to do with which library are you compiling.

I meant that I tested simd and packed_simd with different targets, but, as you note, both targets have the same target CPU scheduling model.

Exynos5 has cortex-a15 cores, so you probably want to specify -C target-cpu=cortex-a15

I'm not trying to target Exynos 5 specifically. I'm trying to write code for ARMv7 phones generally, so I'm compiling for the default/generic CPU. Exynos 5 just happens to be the only ARMv7 chip that I have available that

Not really, I wrote the LLVM IR directly, but it would be

Thank you.

Also please let me know if you manage to benchmark the "bitcast to 32-bit version".

So far, I'm having trouble with this. For some reason, the implementation of any causes test failures without printing any details of failure reason. Stepping doesn't work normally, and I have no idea why, so I've been unable to examine in gdb how the code behaves. I don't see anything wrong with the code you suggested, though.

Some time into the test run, there's a panic while panicking, and the process dies with illegal instruction. When asking gdb to give an assembly view of that, the instruction shown is in the panic implementation and seems to have nothing to do with SIMD.

hsivonen commented 5 years ago

Changing let d = vpmax_u8(c, std::mem::uninitialized()); to let d = vpmax_u8(c, c); does not help, so it's not about the uninitialized bit causing confusion in LLVM.

gnzlbg commented 5 years ago

You might need to make any #[target_feature(enable = "v7,neon")] (maybe) ?

hsivonen commented 5 years ago

You might need to make any #[target_feature(enable = "v7,neon")] (maybe) ?

I got the same result with that. :-(

gnzlbg commented 5 years ago

Stepping doesn't work normally, and I have no idea why, so I've been unable to examine in gdb how the code behaves. I don't see anything wrong with the code you suggested, though.

Can you try with eprintln! ? E.g. get the implementation of any to print the input, and print the output, and see if it does something weird. What you say looks like you are getting a double panic for some reason, but I am not sure what role could any play in that (I don't think it plays any, but who knows, if the code has UB then anything can happen).

hsivonen commented 5 years ago

I've now tried the any implemenation in three different ways:

  1. Your suggestion but with std::arch instead of linking LLVM intrinsics.
  2. Your suggestion exactly.
  3. Adapting the code from simd to use packed_simd shuffles and LLVM intrinsics.

In all three cases, the generated code is the same:

   39f62:       f961 0a0f       vld1.8  {d16-d17}, [r1]
   39f66:       efc9 2070       vshr.s8 q9, q8, #7
   39f6a:       fff0 25e2       vmvn    q9, q9
   39f6e:       ff42 2aa3       vpmax.u8        d18, d18, d19
   39f72:       ff42 2aa0       vpmax.u8        d18, d18, d16
   39f76:       ee12 1b90       vmov.32 r1, d18[0]
   39f7a:       2900            cmp     r1, #0

Compared to the code generated by Rust 1.32 for the simd crate, the vmvn instruction is extra. AFAICT, it negates the bits of the vector for no good reason, which is likely why the code fails tests but runs benches.

Any ideas where the vmvn instruction could be coming from?

(I'll try eprintln() next.)

hsivonen commented 5 years ago

Hah. I found the bug. Bad operator placement relative to parentheses.

gnzlbg commented 5 years ago

Hah. I found the bug. Bad operator placement relative to parentheses.

What was exactly the issue? Is packed_simd affected?

hsivonen commented 5 years ago

What was exactly the issue?

When I changed expressions of the form !something.any(), I accidentally did any16x8(!something) instead of !any16x8(something).

Is packed_simd affected?

No.

gnzlbg commented 5 years ago

Ah I see. So what about the performance of the three approaches when compared to the old simd crate ?

hsivonen commented 5 years ago

For the simplest case, I now get simd-like perf: test bench_mem_copy_ascii_to_ascii_1000 ... bench: 120,451 ns/iter (+/- 14,009) = 4151 MB/s

I'll run the full suite in a few different ways.

gnzlbg commented 5 years ago

For the simplest case, I now get simd-like perf:

Is that using packed_simd, or your custom implementations?

hsivonen commented 5 years ago

Is that using packed_simd, or your custom implementations?

packed_simd, except for boolean reduction the code you suggested earlier in this thread (the two-vpmax version).

gnzlbg commented 5 years ago

Does this mean that we can close this? Or does the packed_simd implementation have a perf bug ?

hsivonen commented 5 years ago

Or does the packed_simd implementation have a perf bug ?

It appears that packed_simd has a perf bug.

So far (more benches still running), it looks like the code from your earlier comment, which matches the simd crate for instructions generated in the m8x16 case, is faster than what packed_simd 0.3.3 has.

The simd crate generates different code for reducing m16x8. I'm also benching if that makes any difference.

gnzlbg commented 5 years ago

I've sent a PR https://github.com/rust-lang-nursery/packed_simd/pull/219 that should generate the same code as the simd crate. Please benchmark that and let me know the results.

hsivonen commented 5 years ago

It appears that packed_simd has a perf bug.

After more benchmarking, this is quite a bit less clear. :-/

The simd crate generates different code for reducing m16x8. I'm also benching if that makes any difference.

Using vpmax_u16 twice to reduce m16x8 does not make a difference compared to using vpmax_u8 twice to reduce m16x8 (i.e. reducing as if bitcasting to m8x16 and reducing that).

I've sent a PR #219 that should generate the same code as the simd crate. Please benchmark that and let me know the results.

The results are mixed. Your PR is an improvement over the 0.3.3 release for the function I initially focused on upthread: copy_ascii_to_ascii (among the simplest of the functions). Unfortunately, the PR regresses some other benchmarks.

(inline(never) to avoid distractions from the functions getting benchmarked getting inlined into the benchmarking harness in a way that confuses the results.)

Each result file was made by running the benchmark suite four times and choosing the best result for each bench from the four runs. (Best instead of average on the logic that stuff can't get accidentally too fast, so the runs that aren't the best have some interference from other processes that I don't know how to turn off.)

The results are from a Samsung Chromebook 2 running Ubuntu 18.04 on Crouton on Chrome OS 71.

/proc/cpuinfo says:

processor       : 0
model name      : ARMv7 Processor rev 3 (v7l)
BogoMIPS        : 48.00
Features        : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpv4 idiva idivt 
CPU implementer : 0x41
CPU architecture: 7
CPU variant     : 0x2
CPU part        : 0xc0f
CPU revision    : 3

processor       : 1
model name      : ARMv7 Processor rev 3 (v7l)
BogoMIPS        : 48.00
Features        : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpv4 idiva idivt 
CPU implementer : 0x41
CPU architecture: 7
CPU variant     : 0x2
CPU part        : 0xc0f
CPU revision    : 3

processor       : 2
model name      : ARMv7 Processor rev 3 (v7l)
BogoMIPS        : 48.00
Features        : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpv4 idiva idivt 
CPU implementer : 0x41
CPU architecture: 7
CPU variant     : 0x2
CPU part        : 0xc0f
CPU revision    : 3

processor       : 3
model name      : ARMv7 Processor rev 3 (v7l)
BogoMIPS        : 48.00
Features        : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpv4 idiva idivt 
CPU implementer : 0x41
CPU architecture: 7
CPU variant     : 0x2
CPU part        : 0xc0f
CPU revision    : 3

Hardware        : SAMSUNG EXYNOS5 (Flattened Device Tree)
Revision        : 0000
Serial          : 0000000000000000

uname -a says: Linux localhost 3.8.11 #1 SMP Thu Jan 10 22:24:15 PST 2019 armv7l armv7l armv7l GNU/Linux.

For the bench_mem_ benches, the ones that end with something other than _1000 are noise that optimizer decisions make go whichever way without any particular pattern that one could see from the code changes. Therefore, it helps to ignore those and to leave only the bench_mem_ items that end with _1000.

$ cargo benchcmp --threshold 3 033_never.txt arm32bred_never.txt | grep -v '_1[^0]' | grep -v _3
 name                                               033_never.txt ns/iter  arm32bred_never.txt ns/iter  diff ns/iter   diff %  speedup 
 bench_decode_to_string_ar                          553,235 (568 MB/s)     533,966 (589 MB/s)                -19,269   -3.48%   x 1.04 
 bench_decode_to_string_ko_euc_kr                   433,733 (364 MB/s)     412,185 (383 MB/s)                -21,548   -4.97%   x 1.05 
 bench_decode_to_utf16_de                           295,431 (1079 MB/s)    304,664 (1046 MB/s)                 9,233    3.13%   x 0.97 
 bench_decode_to_utf16_en                           426,945 (1654 MB/s)    440,716 (1603 MB/s)                13,771    3.23%   x 0.97 
 bench_decode_to_utf16_fr                           1,027,770 (914 MB/s)   1,059,029 (887 MB/s)               31,259    3.04%   x 0.97 
 bench_decode_to_utf16_ja_shift_jis                 608,687 (377 MB/s)     628,079 (365 MB/s)                 19,392    3.19%   x 0.97 
 bench_decode_to_utf16_jquery                       39,082 (2218 MB/s)     44,301 (1957 MB/s)                  5,219   13.35%   x 0.88 
 bench_decode_to_utf16_th_windows_874               1,053,734 (763 MB/s)   1,132,539 (710 MB/s)               78,805    7.48%   x 0.93 
 bench_decode_to_utf8_ar                            605,716 (519 MB/s)     583,191 (539 MB/s)                -22,525   -3.72%   x 1.04 
 bench_decode_to_utf8_he                            509,095 (516 MB/s)     493,475 (532 MB/s)                -15,620   -3.07%   x 1.03 
 bench_decode_to_utf8_jquery                        44,591 (1944 MB/s)     46,551 (1862 MB/s)                  1,960    4.40%   x 0.96 
 bench_decode_to_utf8_ko_euc_kr                     433,883 (364 MB/s)     411,766 (383 MB/s)                -22,117   -5.10%   x 1.05 
 bench_decode_to_utf8_zh_cn_gb18030                 821,212 (358 MB/s)     793,208 (371 MB/s)                -28,004   -3.41%   x 1.04 
 bench_encode_from_utf16_ar                         111,534 (206 MB/s)     107,325 (214 MB/s)                 -4,209   -3.77%   x 1.04 
 bench_encode_from_utf16_cs                         199,170 (243 MB/s)     206,366 (234 MB/s)                  7,196    3.61%   x 0.97 
 bench_encode_from_utf16_el                         158,954 (204 MB/s)     152,994 (211 MB/s)                 -5,960   -3.75%   x 1.04 
 bench_encode_from_utf16_th_windows_874             805,283 (82 MB/s)      869,270 (76 MB/s)                  63,987    7.95%   x 0.93 
 bench_encode_from_utf16_tr                         231,040 (242 MB/s)     241,263 (232 MB/s)                 10,223    4.42%   x 0.96 
 bench_encode_from_utf16_zh_tw_big5                 23,132,804             24,908,525                      1,775,721    7.68%   x 0.93 
 bench_encode_from_utf8_ar                          1,890 (12192 MB/s)     2,438 (9452 MB/s)                     548   28.99%   x 0.78 
 bench_encode_from_utf8_cs                          7,093 (6829 MB/s)      5,851 (8278 MB/s)                  -1,242  -17.51%   x 1.21 
 bench_encode_from_utf8_de                          11,207 (6864 MB/s)     10,782 (7134 MB/s)                   -425   -3.79%   x 1.04 
 bench_encode_from_utf8_el                          2,827 (11472 MB/s)     3,498 (9272 MB/s)                     671   23.74%   x 0.81 
 bench_encode_from_utf8_en                          9,436 (6858 MB/s)      8,687 (7449 MB/s)                    -749   -7.94%   x 1.09 
 bench_encode_from_utf8_fr                          20,583 (6939 MB/s)     19,933 (7166 MB/s)                   -650   -3.16%   x 1.03 
 bench_encode_from_utf8_he                          7,633 (6491 MB/s)      7,059 (7018 MB/s)                    -574   -7.52%   x 1.08 
 bench_encode_from_utf8_ja                          2,870 (11384 MB/s)     4,280 (7633 MB/s)                   1,410   49.13%   x 0.67 
 bench_encode_from_utf8_jquery                      12,718 (6817 MB/s)     13,172 (6582 MB/s)                    454    3.57%   x 0.97 
 bench_encode_from_utf8_ko                          1,664 (11996 MB/s)     2,252 (8864 MB/s)                     588   35.34%   x 0.74 
 bench_encode_from_utf8_ru                          17,480 (6692 MB/s)     16,555 (7066 MB/s)                   -925   -5.29%   x 1.06 
 bench_encode_from_utf8_th                          26,540 (6996 MB/s)     25,689 (7228 MB/s)                   -851   -3.21%   x 1.03 
 bench_encode_from_utf8_tr                          8,381 (6691 MB/s)      7,414 (7564 MB/s)                    -967  -11.54%   x 1.13 
 bench_encode_from_utf8_zh_cn                       2,645 (11677 MB/s)     3,194 (9670 MB/s)                     549   20.76%   x 0.83 
 bench_encode_from_utf8_zh_tw                       2,625 (11790 MB/s)     3,356 (9221 MB/s)                     731   27.85%   x 0.78 
 bench_label_rs_xseucpkdfmtjapanese                 255                    235                                   -20   -7.84%   x 1.09 
 bench_mem_check_utf16_for_latin1_and_bidi_de_1000  203,206 (4921 MB/s)    231,389 (4321 MB/s)                28,183   13.87%   x 0.88 
 bench_mem_check_utf16_for_latin1_and_bidi_ja_1000  675,737 (1479 MB/s)    715,689 (1397 MB/s)                39,952    5.91%   x 0.94 
 bench_mem_convert_str_to_utf16_ascii_1000          179,811 (2780 MB/s)    189,200 (2642 MB/s)                 9,389    5.22%   x 0.95 
 bench_mem_convert_str_to_utf16_bmp_1000            2,313,058 (216 MB/s)   2,386,712 (209 MB/s)               73,654    3.18%   x 0.97 
 bench_mem_convert_utf16_to_str_bmp_1000            4,302,345 (232 MB/s)   4,125,145 (242 MB/s)             -177,200   -4.12%   x 1.04 
 bench_mem_convert_utf16_to_utf8_ascii_1000         213,731 (4678 MB/s)    220,881 (4527 MB/s)                 7,150    3.35%   x 0.97 
 bench_mem_convert_utf16_to_utf8_bmp_1000           4,284,716 (233 MB/s)   4,072,866 (245 MB/s)             -211,850   -4.94%   x 1.05 
 bench_mem_convert_utf8_to_utf16_ascii_1000         183,529 (2724 MB/s)    190,768 (2620 MB/s)                 7,239    3.94%   x 0.96 
 bench_mem_copy_ascii_to_ascii_1000                 141,813 (3525 MB/s)    120,232 (4158 MB/s)               -21,581  -15.22%   x 1.18 
 bench_mem_copy_ascii_to_basic_latin_1000           181,765 (2750 MB/s)    189,377 (2640 MB/s)                 7,612    4.19%   x 0.96 
 bench_mem_ensure_utf16_validity_ascii_1000         211,703 (4723 MB/s)    232,711 (4297 MB/s)                21,008    9.92%   x 0.91 
 bench_mem_ensure_utf16_validity_bmp_1000           211,654 (4724 MB/s)    232,717 (4297 MB/s)                21,063    9.95%   x 0.91 
 bench_mem_ensure_utf16_validity_latin1_1000        211,638 (4725 MB/s)    232,710 (4297 MB/s)                21,072    9.96%   x 0.91 
 bench_mem_is_str_latin1_false_1000                 3,274 (152871 MB/s)    3,529 (141824 MB/s)                   255    7.79%   x 0.93 
 bench_mem_is_str_latin1_true_1000                  123,157 (4059 MB/s)    113,497 (4405 MB/s)                -9,660   -7.84%   x 1.09 
 bench_mem_is_utf16_bidi_de_1000                    260,038 (3845 MB/s)    218,217 (4582 MB/s)               -41,821  -16.08%   x 1.19 
 bench_mem_is_utf16_bidi_ja_1000                    683,179 (1463 MB/s)    716,270 (1396 MB/s)                33,091    4.84%   x 0.95 
 bench_mem_is_utf16_bidi_ru_1000                    401,691 (2489 MB/s)    374,750 (2668 MB/s)               -26,941   -6.71%   x 1.07 
 bench_mem_is_utf16_latin1_true_1000                118,743 (8421 MB/s)    123,255 (8113 MB/s)                 4,512    3.80%   x 0.96 
 bench_mem_is_utf8_bidi_de_1000                     387,101 (1291 MB/s)    368,808 (1355 MB/s)               -18,293   -4.73%   x 1.05 
 bench_mem_is_utf8_latin1_false_1000                5,282 (94661 MB/s)     5,538 (90285 MB/s)                    256    4.85%   x 0.95 
 bench_mem_utf16_valid_up_to_ascii_1000             210,709 (4745 MB/s)    232,988 (4292 MB/s)                22,279   10.57%   x 0.90 
 bench_mem_utf16_valid_up_to_bmp_1000               210,710 (4745 MB/s)    232,958 (4292 MB/s)                22,248   10.56%   x 0.90 
 bench_mem_utf16_valid_up_to_latin1_1000            210,733 (4745 MB/s)    232,968 (4292 MB/s)                22,235   10.55%   x 0.90 

It's hard to make a call either way. Do you see a pattern that helps you call one clearly better than the other?

Curiously, when doing changes like testing between vpmax_u8 vs. vpmax_u16 for m16x8 reduction, the results were remarkably noiseless. (Benchmark runs on this computer are remarkable repeatable, so changes over 2% either way after picking the per-bench best of 4 runs do generally implicate difference in code performance and not just randomness in benchmarking runs.) Yet, comparing simd vs. your PR, there are rather large changes even though the instructions generated for the reductions are the same. This suggests that the overall structural differences between simd and packed_simd could cause some semi-chaotic and hard to articulate differences in how the optimization passes happen to do stuff.

$ cargo benchcmp --threshold 3 simd_never.txt arm32bred_never.txt | grep -v '_1[^0]' | grep -v _3
 name                                               simd_never.txt ns/iter  arm32bred_never.txt ns/iter  diff ns/iter   diff %  speedup 
 bench_decode_to_string_ar                          560,170 (561 MB/s)      533,966 (589 MB/s)                -26,204   -4.68%   x 1.05 
 bench_decode_to_string_el                          365,064 (664 MB/s)      349,477 (693 MB/s)                -15,587   -4.27%   x 1.04 
 bench_decode_to_string_he                          473,933 (554 MB/s)      452,095 (581 MB/s)                -21,838   -4.61%   x 1.05 
 bench_decode_to_string_jquerycat                   523,681 (2649 MB/s)     565,655 (2452 MB/s)                41,974    8.02%   x 0.93 
 bench_decode_to_string_ru                          1,063,606 (590 MB/s)    1,020,616 (615 MB/s)              -42,990   -4.04%   x 1.04 
 bench_decode_to_utf16_ja_iso_2022_jp               1,592,987 (154 MB/s)    1,840,083 (134 MB/s)              247,096   15.51%   x 0.87 
 bench_decode_to_utf16_jquery                       38,801 (2234 MB/s)      44,301 (1957 MB/s)                  5,500   14.17%   x 0.88 
 bench_decode_to_utf8_ar                            609,612 (516 MB/s)      583,191 (539 MB/s)                -26,421   -4.33%   x 1.05 
 bench_decode_to_utf8_el                            403,713 (600 MB/s)      386,847 (626 MB/s)                -16,866   -4.18%   x 1.04 
 bench_decode_to_utf8_he                            514,750 (510 MB/s)      493,475 (532 MB/s)                -21,275   -4.13%   x 1.04 
 bench_decode_to_utf8_jquery                        43,646 (1986 MB/s)      46,551 (1862 MB/s)                  2,905    6.66%   x 0.94 
 bench_decode_to_utf8_ru                            1,182,731 (531 MB/s)    1,136,625 (552 MB/s)              -46,106   -3.90%   x 1.04 
 bench_decode_to_utf8_zh_tw                         401,191 (779 MB/s)      388,837 (803 MB/s)                -12,354   -3.08%   x 1.03 
 bench_encode_from_utf16_ar                         110,713 (208 MB/s)      107,325 (214 MB/s)                 -3,388   -3.06%   x 1.03 
 bench_encode_from_utf16_cs                         199,214 (243 MB/s)      206,366 (234 MB/s)                  7,152    3.59%   x 0.97 
 bench_encode_from_utf16_en                         53,005 (1220 MB/s)      55,295 (1170 MB/s)                  2,290    4.32%   x 0.96 
 bench_encode_from_utf16_ja_euc_jp                  8,269,645 (2 MB/s)      9,796,333 (2 MB/s)              1,526,688   18.46%   x 0.84 
 bench_encode_from_utf16_ja_iso_2022_jp             8,811,931 (2 MB/s)      10,315,491 (2 MB/s)             1,503,560   17.06%   x 0.85 
 bench_encode_from_utf16_ja_shift_jis               8,964,035 (2 MB/s)      10,420,600 (2 MB/s)             1,456,565   16.25%   x 0.86 
 bench_encode_from_utf16_jquery                     53,468 (1621 MB/s)      46,969 (1846 MB/s)                 -6,499  -12.15%   x 1.14 
 bench_encode_from_utf16_th_windows_874             813,704 (81 MB/s)       869,270 (76 MB/s)                  55,566    6.83%   x 0.94 
 bench_encode_from_utf16_zh_cn_gb18030              25,062,645              29,195,487                      4,132,842   16.49%   x 0.86 
 bench_encode_from_utf16_zh_tw_big5                 21,373,295              24,908,525                      3,535,230   16.54%   x 0.86 
 bench_encode_from_utf8_ar                          1,983 (11620 MB/s)      2,438 (9452 MB/s)                     455   22.95%   x 0.81 
 bench_encode_from_utf8_cs                          6,984 (6935 MB/s)       5,851 (8278 MB/s)                  -1,133  -16.22%   x 1.19 
 bench_encode_from_utf8_el                          2,839 (11424 MB/s)      3,498 (9272 MB/s)                     659   23.21%   x 0.81 
 bench_encode_from_utf8_en                          9,385 (6895 MB/s)       8,687 (7449 MB/s)                    -698   -7.44%   x 1.08 
 bench_encode_from_utf8_fr                          20,567 (6945 MB/s)      19,933 (7166 MB/s)                   -634   -3.08%   x 1.03 
 bench_encode_from_utf8_ja                          3,038 (10754 MB/s)      4,280 (7633 MB/s)                   1,242   40.88%   x 0.71 
 bench_encode_from_utf8_ja_euc_jp                   8,298,254 (2 MB/s)      9,817,691 (2 MB/s)              1,519,437   18.31%   x 0.85 
 bench_encode_from_utf8_ja_iso_2022_jp              9,551,929 (2 MB/s)      10,448,937 (2 MB/s)               897,008    9.39%   x 0.91 
 bench_encode_from_utf8_ja_shift_jis                8,981,893 (2 MB/s)      10,428,791 (2 MB/s)             1,446,898   16.11%   x 0.86 
 bench_encode_from_utf8_jquery                      12,487 (6943 MB/s)      13,172 (6582 MB/s)                    685    5.49%   x 0.95 
 bench_encode_from_utf8_ko                          1,639 (12179 MB/s)      2,252 (8864 MB/s)                     613   37.40%   x 0.73 
 bench_encode_from_utf8_ru                          17,514 (6679 MB/s)      16,555 (7066 MB/s)                   -959   -5.48%   x 1.06 
 bench_encode_from_utf8_th_windows_874              984,616 (67 MB/s)       1,016,300 (65 MB/s)                31,684    3.22%   x 0.97 
 bench_encode_from_utf8_tr                          8,371 (6699 MB/s)       7,414 (7564 MB/s)                    -957  -11.43%   x 1.13 
 bench_encode_from_utf8_vi                          12,106 (7164 MB/s)      12,602 (6882 MB/s)                    496    4.10%   x 0.96 
 bench_encode_from_utf8_zh_cn                       2,673 (11554 MB/s)      3,194 (9670 MB/s)                     521   19.49%   x 0.84 
 bench_encode_from_utf8_zh_cn_gb18030               25,101,808              29,192,154                      4,090,346   16.30%   x 0.86 
 bench_encode_from_utf8_zh_tw                       2,725 (11357 MB/s)      3,356 (9221 MB/s)                     631   23.16%   x 0.81 
 bench_encode_from_utf8_zh_tw_big5                  21,374,658              24,891,266                      3,516,608   16.45%   x 0.86 
 bench_encode_to_vec_ja_euc_jp                      8,299,658 (2 MB/s)      9,817,745 (2 MB/s)              1,518,087   18.29%   x 0.85 
 bench_encode_to_vec_ja_iso_2022_jp                 9,553,366 (2 MB/s)      10,448,916 (2 MB/s)               895,550    9.37%   x 0.91 
 bench_encode_to_vec_ja_shift_jis                   8,983,749 (2 MB/s)      10,431,337 (2 MB/s)             1,447,588   16.11%   x 0.86 
 bench_encode_to_vec_th_windows_874                 985,458 (67 MB/s)       1,016,600 (65 MB/s)                31,142    3.16%   x 0.97 
 bench_encode_to_vec_zh_cn_gb18030                  25,103,900              29,198,329                      4,094,429   16.31%   x 0.86 
 bench_encode_to_vec_zh_tw_big5                     21,371,116              24,899,750                      3,528,634   16.51%   x 0.86 
 bench_label_rs_cseucpkdfmtjapanesx                 119                     130                                    11    9.24%   x 0.92 
 bench_label_rs_utf_8                               87                      80                                     -7   -8.05%   x 1.09 
 bench_label_rs_utf_8_upper                         83                      80                                     -3   -3.61%   x 1.04 
 bench_label_rs_xseucpkdfmtjapanese                 222                     235                                    13    5.86%   x 0.94 
 bench_mem_check_utf16_for_latin1_and_bidi_ja_1000  1,202,808 (831 MB/s)    715,689 (1397 MB/s)              -487,119  -40.50%   x 1.68 
 bench_mem_check_utf16_for_latin1_and_bidi_ru_1000  439,641 (2274 MB/s)     353,064 (2832 MB/s)               -86,577  -19.69%   x 1.25 
 bench_mem_check_utf16_for_latin1_and_bidi_th_1000  945,079 (1058 MB/s)     498,152 (2007 MB/s)              -446,927  -47.29%   x 1.90 
 bench_mem_convert_latin1_to_utf8_1000              610,233 (819 MB/s)      639,258 (782 MB/s)                 29,025    4.76%   x 0.95 
 bench_mem_convert_utf16_to_str_bmp_1000            4,311,470 (231 MB/s)    4,125,145 (242 MB/s)             -186,325   -4.32%   x 1.05 
 bench_mem_convert_utf16_to_utf8_bmp_1000           4,323,800 (231 MB/s)    4,072,866 (245 MB/s)             -250,934   -5.80%   x 1.06 
 bench_mem_copy_ascii_to_ascii_1000                 126,927 (3939 MB/s)     120,232 (4158 MB/s)                -6,695   -5.27%   x 1.06 
 bench_mem_is_ascii_false_1000                      7,284 (68643 MB/s)      7,039 (71032 MB/s)                   -245   -3.36%   x 1.03 
 bench_mem_is_ascii_true_1000                       76,802 (6510 MB/s)      70,546 (7087 MB/s)                 -6,256   -8.15%   x 1.09 
 bench_mem_is_basic_latin_false_1000                7,213 (138638 MB/s)     6,409 (156030 MB/s)                  -804  -11.15%   x 1.13 
 bench_mem_is_str_bidi_de_1000                      369,654 (1352 MB/s)     383,643 (1303 MB/s)                13,989    3.78%   x 0.96 
 bench_mem_is_str_bidi_ru_1000                      2,017,245 (247 MB/s)    2,107,116 (237 MB/s)               89,871    4.46%   x 0.96 
 bench_mem_is_str_latin1_false_1000                 3,278 (152684 MB/s)     3,529 (141824 MB/s)                   251    7.66%   x 0.93 
 bench_mem_is_utf16_bidi_ja_1000                    1,186,595 (842 MB/s)    716,270 (1396 MB/s)              -470,325  -39.64%   x 1.66 
 bench_mem_is_utf16_bidi_ru_1000                    438,825 (2278 MB/s)     374,750 (2668 MB/s)               -64,075  -14.60%   x 1.17 
 bench_mem_is_utf16_bidi_th_1000                    945,095 (1058 MB/s)     498,152 (2007 MB/s)              -446,943  -47.29%   x 1.90 
 bench_mem_is_utf8_bidi_de_1000                     399,627 (1251 MB/s)     368,808 (1355 MB/s)               -30,819   -7.71%   x 1.08 
 bench_mem_is_utf8_bidi_th_1000                     2,164,933 (230 MB/s)    1,677,325 (298 MB/s)             -487,608  -22.52%   x 1.29 
 bench_mem_is_utf8_latin1_false_1000                5,036 (99285 MB/s)      5,538 (90285 MB/s)                    502    9.97%   x 0.91 
gnzlbg commented 5 years ago

Do you see a pattern that helps you call one clearly better than the other?

Not really, it seems, however, that the current 0.3.3 version and the PR aren't really consistently worse than the simd, just different. Why? Slightly different IR being generated, different versions of the LLVM backend, compiler, etc.

If you want to squeeze the maximum out of this, the best thing you could do is, e.g., take the best results of the 0.3.3 branch, and the best results of the simd crate, and merge them into a single benchmark file. Then take my PR branch, and try to modify it to be either better or have only tiny regressions on your armv7+neon target, e.g., using bench-cmp to visualize the results.

Modifying the code generated is not hard. See the PR for the diff, there is a macro that allows you to pattern match on the mask vector types, and implement the all and any operations however you see fit. I am on Discord and Zulip and can help you to try to improve the performance on these targets, but without the ability to benchmark this is not something that I can do myself (i'm basically making changes blindly based on gut feeling and LLVM MCA, which does not appear to be very accurate).

hsivonen commented 5 years ago

Thanks.

I had hoped that two vpmax operations followed by extraction to one ALU register would have been either clearly better or clearly worse than one vpmax operation followed by a simultaneous extraction to two ALU registers.

Now that it's a mix, considering other schedule constraints, I think I shouldn't go down the rathole of trying to see if there's a way to somehow have the best of both options. (That seems like a relatively open-ended exploration with the likely outcome there isn't a way for the library to do things so that it always works out optimally for the callers.)

So let's keep packed_simd in its current state, which is also supported by LLVM MCA (one vpmax followed by a simultaneous extraction to two ALU registers). I consider the initial report (four vpmax operations being inefficient) now fixed.

Thank you for the help and sorry about my focus on a simple case, bench_mem_copy_ascii_to_ascii_1000, not generalizing and causing work that probably won't land.

gnzlbg commented 5 years ago

If you are able to extract minimal examples that do not generate optimal code, but should, we can fill in LLVM bugs. For that we need to know how the optimal code looks like though.