Closed jhorstmann closed 1 month ago
The way the loop ors the bits for the is-ascii check into a single usize doesn't seem ideal for vectorization. keeping the lanes independent should yield better autovectorization results.
@the8472 Thanks for the suggestion. I already started with a simd implementation, but now checked again if the autovectorization could be improved. With the following main loop
while slice.len() >= N {
let chunk = &slice[..N];
let mut is_ascii = true;
for j in 0..N {
is_ascii &= chunk[j] <= 127;
}
if !is_ascii {
break;
}
for j in 0..N {
out_slice[j] = MaybeUninit::new(convert(&chunk[j]));
}
i += N;
slice = &slice[N..];
out_slice = &mut out_slice[N..];
}
The assembly and performance is indeed better, but there is still some weird shuffling and shifting going on:
0,08 │ 80:┌─→movdqu xmm3,XMMWORD PTR [rbx+rax*1] ▒
0,07 │ │ pshufd xmm4,xmm3,0xee ▒
0,37 │ │ por xmm4,xmm3 ▒
6,22 │ │ pshufd xmm5,xmm4,0x55 ▒
0,16 │ │ por xmm5,xmm4 ▒
│ │ movdqa xmm4,xmm5 ▒
2,02 │ │ psrld xmm4,0x10 ▒
4,48 │ │ por xmm4,xmm5 ▒
0,07 │ │ movdqa xmm5,xmm4 ▒
0,08 │ │ psrlw xmm5,0x8 ▒
0,60 │ │ por xmm5,xmm4 ▒
6,58 │ │ movd ecx,xmm5 ▒
│ │ test cl,cl ▒
│ │↓ js ef ▒
│ │ movdqa xmm4,xmm3 ▒
1,80 │ │ paddb xmm4,xmm0 ▒
4,70 │ │ movdqa xmm5,xmm4 ▒
│ │ pminub xmm5,xmm1 ▒
│ │ pcmpeqb xmm5,xmm4 ▒
0,37 │ │ pand xmm5,xmm2 ▒
8,50 │ │ por xmm5,xmm3 ▒
│ │ movdqu XMMWORD PTR [r13+rax*1+0x0],xmm5 ▒
│ │ add rax,0x10 ▒
1,94 │ │ add r12,0xfffffffffffffff0 ▒
1,05 │ ├──cmp r12,0xf ▒
4,19 │ └──ja 80
The explicit simd version looks a bit better, mostly because the ascii check directly translates to a pmovmskb
:
const LANES: usize = 16;
let simd_range_start = Simd::splat(range_start);
let simd_range_end = Simd::splat(range_end);
let simd_xor_value = Simd::splat(xor_value);
while slice.len() >= LANES {
let chunk = Simd::<u8, LANES>::from_slice(slice);
let is_ascii = chunk.cast::<i8>().simd_ge(Simd::splat(0));
if is_ascii.all() {
let is_in_range = chunk.simd_ge(simd_range_start) & chunk.simd_le(simd_range_end);
let converted = is_in_range.select(chunk ^ simd_xor_value, chunk);
// SAFETY: output has enough capacity and we never read the uninitialized slice
unsafe {
let out_slice = core::slice::from_raw_parts_mut(out_ptr, LANES);
converted.copy_to_slice(out_slice);
out_ptr = out_ptr.add(LANES);
}
i += LANES;
slice = &slice[LANES..];
} else {
break;
}
}
│ 80:┌─→movdqu xmm3,XMMWORD PTR [r14+rcx*1] ▒
0,12 │ │ pmovmskb edx,xmm3 ▒
0,12 │ │ test edx,edx ▒
│ │↓ jne d4 ▒
5,82 │ │ movdqa xmm4,xmm3 ▒
0,04 │ │ paddb xmm4,xmm0 ▒
│ │ movdqa xmm5,xmm4 ▒
0,08 │ │ pminub xmm5,xmm1 ▒
10,05 │ │ pcmpeqb xmm5,xmm4 ▒
│ │ movdqa xmm4,xmm5 ▒
│ │ pandn xmm4,xmm3 ▒
│ │ pxor xmm3,xmm2 ▒
7,02 │ │ pand xmm3,xmm5 ▒
0,04 │ │ por xmm3,xmm4 ▒
│ │ movdqu XMMWORD PTR [rax+rcx*1],xmm3 ▒
│ │ add rcx,0x10 ▒
9,69 │ │ mov rdx,rsi ▒
│ │ add rdx,0xfffffffffffffff0 ▒
│ │ mov rsi,rdx ▒
│ ├──cmp rdx,0xf ▒
5,03 │ └──ja 80
Do you have a preference here between autovectorization and explicit simd?
If the SIMD impl results in reasonable code across architectures, including some without SIMD then that should be fine. I think it would be the first time where we use portable simd unconditionally, so it could use some additional scrutiny.
If it's only meant to target x86 with SSE2 then that'd mean having multiple paths and in that case we'd still have to tweak the generic path anyway. At that point we might as well just optimize the latter further.
The assembly and performance is indeed better, but there is still some weird shuffling and shifting going on:
You can probably get rid of those too by keeping multiple bools in a small array. That way the optimizer will more easily see that it can shove them into independent simd lanes. Increasing the unroll count might help too to fit better into simd registers.
We do get pretty decent autovectorization in other places in the standard library by massaging things into a state that basically looks like arrays-instead-of-SIMD-types. E.g.
You can probably get rid of those too by keeping multiple bools in a small array. That way the optimizer will more easily see that it can shove them into independent simd lanes.
It seems llvm is too "smart" for this trick, on its own that generates the exact same code. The sum
trick does work, although it seems a bit fragile. The following creates a pmovmskb
instruction just like the simd version, but a slightly different alternative with a non_ascii
array and comparing against 0 does not.
let mut is_ascii = [false; N];
for j in 0..N {
is_ascii[j] = chunk[j] <= 127;
}
if is_ascii.into_iter().map(|x| x as u8).sum::<u8>() as usize != N {
break;
}
If the SIMD impl results in reasonable code across architectures, including some without SIMD then that should be fine.
It only needs to be better than the autovectorized version on those platforms ;)
If it's only meant to target x86 with SSE2 then that'd mean having multiple paths and in that case we'd still have to tweak the generic path anyway. At that point we might as well just optimize the latter further.
My initial idea was to gate the simd code on either SSE2 or Neon (possibly also Altivec and RiscV). I'd also add a scalar loop for the non-multiple-of-N remainder, so all-ascii strings are fully handled by specialized code. Currently this remainder goes through the generic char_indices
and conversions::to_lower
code path. On other platforms, only the scalar loop would be used, since I assume the autovectorization would also generate suboptimal code.
But I agree that if similar code quality can be achieved with autovectorization, that would be preferable. I'll open a PR after a little bit more polishing the code.
I'm looking into the performance of
to_lowercase
/to_uppercase
on mostly ascii strings, using a small microbenchmark added tolibrary/alloc/benches/string.rs
.Using linux perf tooling I see that the hot part of the code is the following large loop, which despite heavy use of sse2 instructions only seems to process 32 bytes per iteration.
I don't see an easy way to improve the autovectorization of this code, but it should be relatively easy to explicitly vectorize it using
portable_simd
, and I would like to prepare such a PR if there are no objections. As far as I know,portable_simd
is already in use insidecore
, for example by #103779.