hsivonen / encoding_rs

A Gecko-oriented implementation of the Encoding Standard in Rust
https://docs.rs/encoding_rs/
Other
383 stars 55 forks source link

Switch from simd to packed_simd #23

Closed jrmuizel closed 5 years ago

jrmuizel commented 6 years ago

stdsimd is the replacement for simd

hsivonen commented 5 years ago

Discovered so far: The presence of thumb trampolines is the very first thing that stands out in the assembly. I need to go back and run the simd crate baseline in thumb mode.

From which types (integers, floats, etc.) are the masks created ?

From u8x16 and u16x8.

Unrelated update, some stdsimd refactorings have landed in nightly, and packed_simd should start to build again properly soon.

It seems to be in the present nightly already. Thanks!

hsivonen commented 5 years ago

The presence of thumb trampolines is the very first thing that stands out in the assembly.

Once we get past the trampolines on the crate boundary, inlining from core::arch and packed_simd appears to have worked.

hsivonen commented 5 years ago

Thumb-to-Thumb comparison still shows a regression.

hsivonen commented 5 years ago

With the simd crate, building encoding_rs with --release and --emit asm emits one .s file. With packed_simd 31 .rcgu.s files are emitted. https://doc.rust-lang.org/rustc/codegen-options/index.html suggests that multiple codegen-units can lead to slower code. RUSTFLAGS='-C codegen-units=1' does not appear to change things.

gnzlbg commented 5 years ago

@hsivonen can you fill a rust-lang/rust about the multiple codegen-units issue? cc @mw

hsivonen commented 5 years ago

can you fill a rust-lang/rust about the multiple codegen-units issue?

Filed

hsivonen commented 5 years ago

encoding_rs::mem::copy_ascii_to_ascii regresses significantly. To start with, the inlining situation differs. With manual always/never choices, the results are counter-intuitive (never faster than always with simd), but simd is still faster:

simd, inline(never) test bench_mem_copy_ascii_to_ascii_1000 ... bench: 120,045 ns/iter (+/- 686) = 4165 MB/s

simd, inline(always) test bench_mem_copy_ascii_to_ascii_1000 ... bench: 129,785 ns/iter (+/- 5,024) = 3852 MB/s

packed_simd, inline(never) test bench_mem_copy_ascii_to_ascii_1000 ... bench: 164,637 ns/iter (+/- 3,623) = 3036 MB/s

packed_simd, inline(always) test bench_mem_copy_ascii_to_ascii_1000 ... bench: 160,739 ns/iter (+/- 9,820) = 3110 MB/s

For the never cases, here's the assembly from objdump of the benching binary.

simd:

0006c160 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE>:
   6c160:   b580        push    {r7, lr}
   6c162:   428b        cmp r3, r1
   6c164:   d36a        bcc.n   6c23c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xdc>
   6c166:   468e        mov lr, r1
   6c168:   2910        cmp r1, #16
   6c16a:   d31b        bcc.n   6c1a4 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0x44>
   6c16c:   f002 030f   and.w   r3, r2, #15
   6c170:   f1ae 0c10   sub.w   ip, lr, #16
   6c174:   0701        lsls    r1, r0, #28
   6c176:   d01a        beq.n   6c1ae <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0x4e>
   6c178:   b373        cbz r3, 6c1d8 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0x78>
   6c17a:   2300        movs    r3, #0
   6c17c:   18c1        adds    r1, r0, r3
   6c17e:   f921 0a0f   vld1.8  {d0-d1}, [r1]
   6c182:   ef89 2050   vshr.s8 q1, q0, #7
   6c186:   ff02 2a03   vpmax.u8    d2, d2, d3
   6c18a:   ff02 2a00   vpmax.u8    d2, d2, d0
   6c18e:   ee12 1b10   vmov.32 r1, d2[0]
   6c192:   2900        cmp r1, #0
   6c194:   d14a        bne.n   6c22c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xcc>
   6c196:   18d1        adds    r1, r2, r3
   6c198:   3310        adds    r3, #16
   6c19a:   4563        cmp r3, ip
   6c19c:   f901 0a0f   vst1.8  {d0-d1}, [r1]
   6c1a0:   d9ec        bls.n   6c17c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0x1c>
   6c1a2:   e043        b.n 6c22c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xcc>
   6c1a4:   2300        movs    r3, #0
   6c1a6:   4573        cmp r3, lr
   6c1a8:   d342        bcc.n   6c230 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xd0>
   6c1aa:   4670        mov r0, lr
   6c1ac:   bd80        pop {r7, pc}
   6c1ae:   b33b        cbz r3, 6c200 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xa0>
   6c1b0:   2300        movs    r3, #0
   6c1b2:   18c1        adds    r1, r0, r3
   6c1b4:   f921 0acf   vld1.64 {d0-d1}, [r1]
   6c1b8:   ef89 2050   vshr.s8 q1, q0, #7
   6c1bc:   ff02 2a03   vpmax.u8    d2, d2, d3
   6c1c0:   ff02 2a00   vpmax.u8    d2, d2, d0
   6c1c4:   ee12 1b10   vmov.32 r1, d2[0]
   6c1c8:   bb81        cbnz    r1, 6c22c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xcc>
   6c1ca:   18d1        adds    r1, r2, r3
   6c1cc:   3310        adds    r3, #16
   6c1ce:   4563        cmp r3, ip
   6c1d0:   f901 0a0f   vst1.8  {d0-d1}, [r1]
   6c1d4:   d9ed        bls.n   6c1b2 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0x52>
   6c1d6:   e029        b.n 6c22c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xcc>
   6c1d8:   2300        movs    r3, #0
   6c1da:   18c1        adds    r1, r0, r3
   6c1dc:   f921 0a0f   vld1.8  {d0-d1}, [r1]
   6c1e0:   ef89 2050   vshr.s8 q1, q0, #7
   6c1e4:   ff02 2a03   vpmax.u8    d2, d2, d3
   6c1e8:   ff02 2a00   vpmax.u8    d2, d2, d0
   6c1ec:   ee12 1b10   vmov.32 r1, d2[0]
   6c1f0:   b9e1        cbnz    r1, 6c22c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xcc>
   6c1f2:   18d1        adds    r1, r2, r3
   6c1f4:   3310        adds    r3, #16
   6c1f6:   4563        cmp r3, ip
   6c1f8:   f901 0acf   vst1.64 {d0-d1}, [r1]
   6c1fc:   d9ed        bls.n   6c1da <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0x7a>
   6c1fe:   e015        b.n 6c22c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xcc>
   6c200:   2300        movs    r3, #0
   6c202:   18c1        adds    r1, r0, r3
   6c204:   f921 0acf   vld1.64 {d0-d1}, [r1]
   6c208:   ef89 2050   vshr.s8 q1, q0, #7
   6c20c:   ff02 2a03   vpmax.u8    d2, d2, d3
   6c210:   ff02 2a00   vpmax.u8    d2, d2, d0
   6c214:   ee12 1b10   vmov.32 r1, d2[0]
   6c218:   b941        cbnz    r1, 6c22c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xcc>
   6c21a:   18d1        adds    r1, r2, r3
   6c21c:   3310        adds    r3, #16
   6c21e:   4563        cmp r3, ip
   6c220:   f901 0acf   vst1.64 {d0-d1}, [r1]
   6c224:   d9ed        bls.n   6c202 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xa2>
   6c226:   e001        b.n 6c22c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xcc>
   6c228:   54d1        strb    r1, [r2, r3]
   6c22a:   3301        adds    r3, #1
   6c22c:   4573        cmp r3, lr
   6c22e:   d2bc        bcs.n   6c1aa <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0x4a>
   6c230:   56c1        ldrsb   r1, [r0, r3]
   6c232:   2900        cmp r1, #0
   6c234:   daf8        bge.n   6c228 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xc8>
   6c236:   469e        mov lr, r3
   6c238:   4670        mov r0, lr
   6c23a:   bd80        pop {r7, pc}
   6c23c:   4803        ldr r0, [pc, #12]   ; (6c24c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xec>)
   6c23e:   2130        movs    r1, #48 ; 0x30
   6c240:   4a03        ldr r2, [pc, #12]   ; (6c250 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xf0>)
   6c242:   4478        add r0, pc
   6c244:   447a        add r2, pc
   6c246:   f7ff fefb   bl  6c040 <_ZN3std9panicking11begin_panic17hb6db914fa10d35c1E>
   6c24a:   defe        udf #254    ; 0xfe
   6c24c:   009c918c    .word   0x009c918c
   6c250:   009f3988    .word   0x009f3988

packed_simd:

00056314 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E>:
   56314:   b570        push    {r4, r5, r6, lr}
   56316:   428b        cmp r3, r1
   56318:   f0c0 8082   bcc.w   56420 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x10c>
   5631c:   468e        mov lr, r1
   5631e:   2910        cmp r1, #16
   56320:   d320        bcc.n   56364 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x50>
   56322:   f002 030f   and.w   r3, r2, #15
   56326:   f1ae 0c10   sub.w   ip, lr, #16
   5632a:   0701        lsls    r1, r0, #28
   5632c:   d01f        beq.n   5636e <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x5a>
   5632e:   b3cb        cbz r3, 563a4 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x90>
   56330:   2300        movs    r3, #0
   56332:   18c1        adds    r1, r0, r3
   56334:   f961 0a0f   vld1.8  {d16-d17}, [r1]
   56338:   efc9 2070   vshr.s8 q9, q8, #7
   5633c:   ee33 1b90   vmov.32 r1, d19[1]
   56340:   ee32 4b90   vmov.32 r4, d18[1]
   56344:   ee13 5b90   vmov.32 r5, d19[0]
   56348:   ee12 6b90   vmov.32 r6, d18[0]
   5634c:   4321        orrs    r1, r4
   5634e:   ea46 0405   orr.w   r4, r6, r5
   56352:   4321        orrs    r1, r4
   56354:   d15c        bne.n   56410 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xfc>
   56356:   18d1        adds    r1, r2, r3
   56358:   3310        adds    r3, #16
   5635a:   4563        cmp r3, ip
   5635c:   f941 0a0f   vst1.8  {d16-d17}, [r1]
   56360:   d9e7        bls.n   56332 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x1e>
   56362:   e055        b.n 56410 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xfc>
   56364:   2300        movs    r3, #0
   56366:   4573        cmp r3, lr
   56368:   d354        bcc.n   56414 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x100>
   5636a:   4670        mov r0, lr
   5636c:   bd70        pop {r4, r5, r6, pc}
   5636e:   b39b        cbz r3, 563d8 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xc4>
   56370:   2300        movs    r3, #0
   56372:   18c1        adds    r1, r0, r3
   56374:   f961 0acf   vld1.64 {d16-d17}, [r1]
   56378:   efc9 2070   vshr.s8 q9, q8, #7
   5637c:   ee33 1b90   vmov.32 r1, d19[1]
   56380:   ee32 4b90   vmov.32 r4, d18[1]
   56384:   ee13 5b90   vmov.32 r5, d19[0]
   56388:   ee12 6b90   vmov.32 r6, d18[0]
   5638c:   4321        orrs    r1, r4
   5638e:   ea46 0405   orr.w   r4, r6, r5
   56392:   4321        orrs    r1, r4
   56394:   d13c        bne.n   56410 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xfc>
   56396:   18d1        adds    r1, r2, r3
   56398:   3310        adds    r3, #16
   5639a:   4563        cmp r3, ip
   5639c:   f941 0a0f   vst1.8  {d16-d17}, [r1]
   563a0:   d9e7        bls.n   56372 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x5e>
   563a2:   e035        b.n 56410 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xfc>
   563a4:   2300        movs    r3, #0
   563a6:   18c1        adds    r1, r0, r3
   563a8:   f961 0a0f   vld1.8  {d16-d17}, [r1]
   563ac:   efc9 2070   vshr.s8 q9, q8, #7
   563b0:   ee33 1b90   vmov.32 r1, d19[1]
   563b4:   ee32 4b90   vmov.32 r4, d18[1]
   563b8:   ee13 5b90   vmov.32 r5, d19[0]
   563bc:   ee12 6b90   vmov.32 r6, d18[0]
   563c0:   4321        orrs    r1, r4
   563c2:   ea46 0405   orr.w   r4, r6, r5
   563c6:   4321        orrs    r1, r4
   563c8:   d122        bne.n   56410 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xfc>
   563ca:   18d1        adds    r1, r2, r3
   563cc:   3310        adds    r3, #16
   563ce:   4563        cmp r3, ip
   563d0:   f941 0acf   vst1.64 {d16-d17}, [r1]
   563d4:   d9e7        bls.n   563a6 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x92>
   563d6:   e01b        b.n 56410 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xfc>
   563d8:   2300        movs    r3, #0
   563da:   18c1        adds    r1, r0, r3
   563dc:   f961 0acf   vld1.64 {d16-d17}, [r1]
   563e0:   efc9 2070   vshr.s8 q9, q8, #7
   563e4:   ee33 1b90   vmov.32 r1, d19[1]
   563e8:   ee32 4b90   vmov.32 r4, d18[1]
   563ec:   ee13 5b90   vmov.32 r5, d19[0]
   563f0:   ee12 6b90   vmov.32 r6, d18[0]
   563f4:   4321        orrs    r1, r4
   563f6:   ea46 0405   orr.w   r4, r6, r5
   563fa:   4321        orrs    r1, r4
   563fc:   d108        bne.n   56410 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xfc>
   563fe:   18d1        adds    r1, r2, r3
   56400:   3310        adds    r3, #16
   56402:   4563        cmp r3, ip
   56404:   f941 0acf   vst1.64 {d16-d17}, [r1]
   56408:   d9e7        bls.n   563da <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xc6>
   5640a:   e001        b.n 56410 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xfc>
   5640c:   54d1        strb    r1, [r2, r3]
   5640e:   3301        adds    r3, #1
   56410:   4573        cmp r3, lr
   56412:   d2aa        bcs.n   5636a <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x56>
   56414:   56c1        ldrsb   r1, [r0, r3]
   56416:   2900        cmp r1, #0
   56418:   daf8        bge.n   5640c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xf8>
   5641a:   469e        mov lr, r3
   5641c:   4670        mov r0, lr
   5641e:   bd70        pop {r4, r5, r6, pc}
   56420:   4803        ldr r0, [pc, #12]   ; (56430 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x11c>)
   56422:   2130        movs    r1, #48 ; 0x30
   56424:   4a03        ldr r2, [pc, #12]   ; (56434 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x120>)
   56426:   4478        add r0, pc
   56428:   447a        add r2, pc
   5642a:   f000 f9bb   bl  567a4 <_ZN3std9panicking11begin_panic17hd61fceca69156f6cE>
   5642e:   defe        udf #254    ; 0xfe
   56430:   009a3b71    .word   0x009a3b71
   56434:   009ebd14    .word   0x009ebd14

The very first observation: packed_simd generates more instructions.

hsivonen commented 5 years ago

OK, so the horizontal reductions generate worse code under packed_simd.

simd:

   6c17e:   f921 0a0f   vld1.8  {d0-d1}, [r1]
   6c182:   ef89 2050   vshr.s8 q1, q0, #7
   6c186:   ff02 2a03   vpmax.u8    d2, d2, d3
   6c18a:   ff02 2a00   vpmax.u8    d2, d2, d0
   6c18e:   ee12 1b10   vmov.32 r1, d2[0]
   6c192:   2900        cmp r1, #0

packed_simd:

   56334:   f961 0a0f   vld1.8  {d16-d17}, [r1]
   56338:   efc9 2070   vshr.s8 q9, q8, #7
   5633c:   ee33 1b90   vmov.32 r1, d19[1]
   56340:   ee32 4b90   vmov.32 r4, d18[1]
   56344:   ee13 5b90   vmov.32 r5, d19[0]
   56348:   ee12 6b90   vmov.32 r6, d18[0]
   5634c:   4321        orrs    r1, r4
   5634e:   ea46 0405   orr.w   r4, r6, r5
   56352:   4321        orrs    r1, r4
hsivonen commented 5 years ago

Filed as a packed_simd issue.

gnzlbg commented 5 years ago

Are you using the exact same rustc version for the comparisons?

hsivonen commented 5 years ago

Are you using the exact same rustc version for the comparisons?

No, because there isn't a single Rust version that both 1) compiles simd and 2) has a NEON-enabled stdlib.

hsivonen commented 5 years ago

This is now fixed. Thank you for your help and patience.