Closed jrmuizel closed 5 years ago
Discovered so far: The presence of thumb trampolines is the very first thing that stands out in the assembly. I need to go back and run the simd
crate baseline in thumb mode.
From which types (integers, floats, etc.) are the masks created ?
From u8x16
and u16x8
.
Unrelated update, some stdsimd refactorings have landed in nightly, and packed_simd should start to build again properly soon.
It seems to be in the present nightly already. Thanks!
The presence of thumb trampolines is the very first thing that stands out in the assembly.
Once we get past the trampolines on the crate boundary, inlining from core::arch
and packed_simd
appears to have worked.
Thumb-to-Thumb comparison still shows a regression.
With the simd
crate, building encoding_rs
with --release
and --emit asm
emits one .s
file. With packed_simd
31 .rcgu.s
files are emitted. https://doc.rust-lang.org/rustc/codegen-options/index.html suggests that multiple codegen-units can lead to slower code. RUSTFLAGS='-C codegen-units=1'
does not appear to change things.
@hsivonen can you fill a rust-lang/rust about the multiple codegen-units issue? cc @mw
can you fill a rust-lang/rust about the multiple codegen-units issue?
encoding_rs::mem::copy_ascii_to_ascii
regresses significantly. To start with, the inlining situation differs. With manual always/never choices, the results are counter-intuitive (never
faster than always
with simd
), but simd
is still faster:
simd, inline(never) test bench_mem_copy_ascii_to_ascii_1000 ... bench: 120,045 ns/iter (+/- 686) = 4165 MB/s
simd, inline(always) test bench_mem_copy_ascii_to_ascii_1000 ... bench: 129,785 ns/iter (+/- 5,024) = 3852 MB/s
packed_simd, inline(never) test bench_mem_copy_ascii_to_ascii_1000 ... bench: 164,637 ns/iter (+/- 3,623) = 3036 MB/s
packed_simd, inline(always) test bench_mem_copy_ascii_to_ascii_1000 ... bench: 160,739 ns/iter (+/- 9,820) = 3110 MB/s
For the never cases, here's the assembly from objdump
of the benching binary.
simd
:
0006c160 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE>:
6c160: b580 push {r7, lr}
6c162: 428b cmp r3, r1
6c164: d36a bcc.n 6c23c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xdc>
6c166: 468e mov lr, r1
6c168: 2910 cmp r1, #16
6c16a: d31b bcc.n 6c1a4 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0x44>
6c16c: f002 030f and.w r3, r2, #15
6c170: f1ae 0c10 sub.w ip, lr, #16
6c174: 0701 lsls r1, r0, #28
6c176: d01a beq.n 6c1ae <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0x4e>
6c178: b373 cbz r3, 6c1d8 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0x78>
6c17a: 2300 movs r3, #0
6c17c: 18c1 adds r1, r0, r3
6c17e: f921 0a0f vld1.8 {d0-d1}, [r1]
6c182: ef89 2050 vshr.s8 q1, q0, #7
6c186: ff02 2a03 vpmax.u8 d2, d2, d3
6c18a: ff02 2a00 vpmax.u8 d2, d2, d0
6c18e: ee12 1b10 vmov.32 r1, d2[0]
6c192: 2900 cmp r1, #0
6c194: d14a bne.n 6c22c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xcc>
6c196: 18d1 adds r1, r2, r3
6c198: 3310 adds r3, #16
6c19a: 4563 cmp r3, ip
6c19c: f901 0a0f vst1.8 {d0-d1}, [r1]
6c1a0: d9ec bls.n 6c17c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0x1c>
6c1a2: e043 b.n 6c22c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xcc>
6c1a4: 2300 movs r3, #0
6c1a6: 4573 cmp r3, lr
6c1a8: d342 bcc.n 6c230 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xd0>
6c1aa: 4670 mov r0, lr
6c1ac: bd80 pop {r7, pc}
6c1ae: b33b cbz r3, 6c200 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xa0>
6c1b0: 2300 movs r3, #0
6c1b2: 18c1 adds r1, r0, r3
6c1b4: f921 0acf vld1.64 {d0-d1}, [r1]
6c1b8: ef89 2050 vshr.s8 q1, q0, #7
6c1bc: ff02 2a03 vpmax.u8 d2, d2, d3
6c1c0: ff02 2a00 vpmax.u8 d2, d2, d0
6c1c4: ee12 1b10 vmov.32 r1, d2[0]
6c1c8: bb81 cbnz r1, 6c22c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xcc>
6c1ca: 18d1 adds r1, r2, r3
6c1cc: 3310 adds r3, #16
6c1ce: 4563 cmp r3, ip
6c1d0: f901 0a0f vst1.8 {d0-d1}, [r1]
6c1d4: d9ed bls.n 6c1b2 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0x52>
6c1d6: e029 b.n 6c22c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xcc>
6c1d8: 2300 movs r3, #0
6c1da: 18c1 adds r1, r0, r3
6c1dc: f921 0a0f vld1.8 {d0-d1}, [r1]
6c1e0: ef89 2050 vshr.s8 q1, q0, #7
6c1e4: ff02 2a03 vpmax.u8 d2, d2, d3
6c1e8: ff02 2a00 vpmax.u8 d2, d2, d0
6c1ec: ee12 1b10 vmov.32 r1, d2[0]
6c1f0: b9e1 cbnz r1, 6c22c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xcc>
6c1f2: 18d1 adds r1, r2, r3
6c1f4: 3310 adds r3, #16
6c1f6: 4563 cmp r3, ip
6c1f8: f901 0acf vst1.64 {d0-d1}, [r1]
6c1fc: d9ed bls.n 6c1da <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0x7a>
6c1fe: e015 b.n 6c22c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xcc>
6c200: 2300 movs r3, #0
6c202: 18c1 adds r1, r0, r3
6c204: f921 0acf vld1.64 {d0-d1}, [r1]
6c208: ef89 2050 vshr.s8 q1, q0, #7
6c20c: ff02 2a03 vpmax.u8 d2, d2, d3
6c210: ff02 2a00 vpmax.u8 d2, d2, d0
6c214: ee12 1b10 vmov.32 r1, d2[0]
6c218: b941 cbnz r1, 6c22c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xcc>
6c21a: 18d1 adds r1, r2, r3
6c21c: 3310 adds r3, #16
6c21e: 4563 cmp r3, ip
6c220: f901 0acf vst1.64 {d0-d1}, [r1]
6c224: d9ed bls.n 6c202 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xa2>
6c226: e001 b.n 6c22c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xcc>
6c228: 54d1 strb r1, [r2, r3]
6c22a: 3301 adds r3, #1
6c22c: 4573 cmp r3, lr
6c22e: d2bc bcs.n 6c1aa <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0x4a>
6c230: 56c1 ldrsb r1, [r0, r3]
6c232: 2900 cmp r1, #0
6c234: daf8 bge.n 6c228 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xc8>
6c236: 469e mov lr, r3
6c238: 4670 mov r0, lr
6c23a: bd80 pop {r7, pc}
6c23c: 4803 ldr r0, [pc, #12] ; (6c24c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xec>)
6c23e: 2130 movs r1, #48 ; 0x30
6c240: 4a03 ldr r2, [pc, #12] ; (6c250 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xf0>)
6c242: 4478 add r0, pc
6c244: 447a add r2, pc
6c246: f7ff fefb bl 6c040 <_ZN3std9panicking11begin_panic17hb6db914fa10d35c1E>
6c24a: defe udf #254 ; 0xfe
6c24c: 009c918c .word 0x009c918c
6c250: 009f3988 .word 0x009f3988
packed_simd
:
00056314 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E>:
56314: b570 push {r4, r5, r6, lr}
56316: 428b cmp r3, r1
56318: f0c0 8082 bcc.w 56420 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x10c>
5631c: 468e mov lr, r1
5631e: 2910 cmp r1, #16
56320: d320 bcc.n 56364 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x50>
56322: f002 030f and.w r3, r2, #15
56326: f1ae 0c10 sub.w ip, lr, #16
5632a: 0701 lsls r1, r0, #28
5632c: d01f beq.n 5636e <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x5a>
5632e: b3cb cbz r3, 563a4 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x90>
56330: 2300 movs r3, #0
56332: 18c1 adds r1, r0, r3
56334: f961 0a0f vld1.8 {d16-d17}, [r1]
56338: efc9 2070 vshr.s8 q9, q8, #7
5633c: ee33 1b90 vmov.32 r1, d19[1]
56340: ee32 4b90 vmov.32 r4, d18[1]
56344: ee13 5b90 vmov.32 r5, d19[0]
56348: ee12 6b90 vmov.32 r6, d18[0]
5634c: 4321 orrs r1, r4
5634e: ea46 0405 orr.w r4, r6, r5
56352: 4321 orrs r1, r4
56354: d15c bne.n 56410 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xfc>
56356: 18d1 adds r1, r2, r3
56358: 3310 adds r3, #16
5635a: 4563 cmp r3, ip
5635c: f941 0a0f vst1.8 {d16-d17}, [r1]
56360: d9e7 bls.n 56332 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x1e>
56362: e055 b.n 56410 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xfc>
56364: 2300 movs r3, #0
56366: 4573 cmp r3, lr
56368: d354 bcc.n 56414 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x100>
5636a: 4670 mov r0, lr
5636c: bd70 pop {r4, r5, r6, pc}
5636e: b39b cbz r3, 563d8 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xc4>
56370: 2300 movs r3, #0
56372: 18c1 adds r1, r0, r3
56374: f961 0acf vld1.64 {d16-d17}, [r1]
56378: efc9 2070 vshr.s8 q9, q8, #7
5637c: ee33 1b90 vmov.32 r1, d19[1]
56380: ee32 4b90 vmov.32 r4, d18[1]
56384: ee13 5b90 vmov.32 r5, d19[0]
56388: ee12 6b90 vmov.32 r6, d18[0]
5638c: 4321 orrs r1, r4
5638e: ea46 0405 orr.w r4, r6, r5
56392: 4321 orrs r1, r4
56394: d13c bne.n 56410 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xfc>
56396: 18d1 adds r1, r2, r3
56398: 3310 adds r3, #16
5639a: 4563 cmp r3, ip
5639c: f941 0a0f vst1.8 {d16-d17}, [r1]
563a0: d9e7 bls.n 56372 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x5e>
563a2: e035 b.n 56410 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xfc>
563a4: 2300 movs r3, #0
563a6: 18c1 adds r1, r0, r3
563a8: f961 0a0f vld1.8 {d16-d17}, [r1]
563ac: efc9 2070 vshr.s8 q9, q8, #7
563b0: ee33 1b90 vmov.32 r1, d19[1]
563b4: ee32 4b90 vmov.32 r4, d18[1]
563b8: ee13 5b90 vmov.32 r5, d19[0]
563bc: ee12 6b90 vmov.32 r6, d18[0]
563c0: 4321 orrs r1, r4
563c2: ea46 0405 orr.w r4, r6, r5
563c6: 4321 orrs r1, r4
563c8: d122 bne.n 56410 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xfc>
563ca: 18d1 adds r1, r2, r3
563cc: 3310 adds r3, #16
563ce: 4563 cmp r3, ip
563d0: f941 0acf vst1.64 {d16-d17}, [r1]
563d4: d9e7 bls.n 563a6 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x92>
563d6: e01b b.n 56410 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xfc>
563d8: 2300 movs r3, #0
563da: 18c1 adds r1, r0, r3
563dc: f961 0acf vld1.64 {d16-d17}, [r1]
563e0: efc9 2070 vshr.s8 q9, q8, #7
563e4: ee33 1b90 vmov.32 r1, d19[1]
563e8: ee32 4b90 vmov.32 r4, d18[1]
563ec: ee13 5b90 vmov.32 r5, d19[0]
563f0: ee12 6b90 vmov.32 r6, d18[0]
563f4: 4321 orrs r1, r4
563f6: ea46 0405 orr.w r4, r6, r5
563fa: 4321 orrs r1, r4
563fc: d108 bne.n 56410 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xfc>
563fe: 18d1 adds r1, r2, r3
56400: 3310 adds r3, #16
56402: 4563 cmp r3, ip
56404: f941 0acf vst1.64 {d16-d17}, [r1]
56408: d9e7 bls.n 563da <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xc6>
5640a: e001 b.n 56410 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xfc>
5640c: 54d1 strb r1, [r2, r3]
5640e: 3301 adds r3, #1
56410: 4573 cmp r3, lr
56412: d2aa bcs.n 5636a <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x56>
56414: 56c1 ldrsb r1, [r0, r3]
56416: 2900 cmp r1, #0
56418: daf8 bge.n 5640c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xf8>
5641a: 469e mov lr, r3
5641c: 4670 mov r0, lr
5641e: bd70 pop {r4, r5, r6, pc}
56420: 4803 ldr r0, [pc, #12] ; (56430 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x11c>)
56422: 2130 movs r1, #48 ; 0x30
56424: 4a03 ldr r2, [pc, #12] ; (56434 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x120>)
56426: 4478 add r0, pc
56428: 447a add r2, pc
5642a: f000 f9bb bl 567a4 <_ZN3std9panicking11begin_panic17hd61fceca69156f6cE>
5642e: defe udf #254 ; 0xfe
56430: 009a3b71 .word 0x009a3b71
56434: 009ebd14 .word 0x009ebd14
The very first observation: packed_simd
generates more instructions.
OK, so the horizontal reductions generate worse code under packed_simd
.
simd
:
6c17e: f921 0a0f vld1.8 {d0-d1}, [r1]
6c182: ef89 2050 vshr.s8 q1, q0, #7
6c186: ff02 2a03 vpmax.u8 d2, d2, d3
6c18a: ff02 2a00 vpmax.u8 d2, d2, d0
6c18e: ee12 1b10 vmov.32 r1, d2[0]
6c192: 2900 cmp r1, #0
packed_simd
:
56334: f961 0a0f vld1.8 {d16-d17}, [r1]
56338: efc9 2070 vshr.s8 q9, q8, #7
5633c: ee33 1b90 vmov.32 r1, d19[1]
56340: ee32 4b90 vmov.32 r4, d18[1]
56344: ee13 5b90 vmov.32 r5, d19[0]
56348: ee12 6b90 vmov.32 r6, d18[0]
5634c: 4321 orrs r1, r4
5634e: ea46 0405 orr.w r4, r6, r5
56352: 4321 orrs r1, r4
Filed as a packed_simd
issue.
Are you using the exact same rustc version for the comparisons?
Are you using the exact same rustc version for the comparisons?
No, because there isn't a single Rust version that both 1) compiles simd
and 2) has a NEON-enabled stdlib.
This is now fixed. Thank you for your help and patience.
stdsimd is the replacement for simd