Optimisation of `impl Compare<&[u8]> for &[u8]` is not optimal

MTCoster commented 11 months ago

Back in 2018, a commit of "various optimisations" (8a977eea083487cea36c495192a1239682926ace) was added. One of the changes made was to switch impl Compare<&[u8]> for &[u8] (lifetimes omitted) from using a straight byte slice comparison to an iterator-based approach.

However, I'm not convinced that this was actually an optimisation (or at least if it was at one point, I don't think it is anymore).

An analysis

Godbolt link for reference.

Let's look at two functions - one a straight copy of the existing code (compare_current), and one almost an exact copy (now using std::cmp::min()) of the original ("unoptimised") code (compare_simpler). I've replaced &self with another parameter to make these standalone functions.

pub fn compare_current<'a, 'b>(a: &'a [u8], b: &'b [u8]) -> CompareResult {
    let pos = a.iter().zip(b.iter()).position(|(a, b)| a != b);

    match pos {
        Some(_) => CompareResult::Error,
        None => {
            if a.len() >= b.len() {
                CompareResult::Ok
            } else {
                CompareResult::Incomplete
            }
        }
    }
}

pub fn compare_simpler<'a, 'b>(a: &'a [u8], b: &'b [u8]) -> CompareResult {
    let len = ::std::cmp::min(a.len(), b.len());

    if &a[..len] != &b[..len] {
      CompareResult::Error
    } else if len < b.len() {
      CompareResult::Incomplete
    } else {
      CompareResult::Ok
    }
}

On x86_64, these are compiled (at -Coptlevel=3) to:

example::compare_current:
        cmp     rsi, rcx
        mov     rax, rcx
        cmovb   rax, rsi
        xor     r8d, r8d
.LBB0_1:
        cmp     rax, r8
        je      .LBB0_4
        movzx   r9d, byte ptr [rdi + r8]
        lea     r10, [r8 + 1]
        cmp     r9b, byte ptr [rdx + r8]
        mov     r8, r10
        je      .LBB0_1
        mov     al, 2
        ret
.LBB0_4:
        cmp     rsi, rcx
        setb    al
        ret

example::compare_simpler:
        push    r14
        push    rbx
        push    rax
        mov     rbx, rcx
        mov     r14, rsi
        cmp     rsi, rcx
        mov     rax, rcx
        cmovb   rax, rsi
        mov     rsi, rdx
        mov     rdx, rax
        call    qword ptr [rip + bcmp@GOTPCREL]
        xor     ecx, ecx
        cmp     rbx, r14
        seta    cl
        test    eax, eax
        mov     eax, 2
        cmove   eax, ecx
        add     rsp, 8
        pop     rbx
        pop     r14
        ret

The gist here is that the current implementation compiles to a tight fully-rolled loop, while the simpler implementation uses a single bcmp (an LLVM builtin, lowered from memcmp since LLVM 9).

I have not done my own micro-benchmarking of these two functions, but it does seem unlikely that a tight loop can perform better than the compiler's solution.

Another (more interesting?) comparison can be made when b has a fixed size. I also had to make a a fixed size to make the compiler change anything though, so this may be irrelevant for the use case of Compare. I suspect this is because any attempt at optimising the comparison for a variable-sized input would be duplicating the implementation of a not insignificant part of bcmp and thus is pointless, since such an implementation could be inlined by LTO.

pub fn compare_current_short<'a, 'b>(a: &'a [u8; 1024 * 1024], b: &'b [u8; 8]) -> CompareResult {
    compare_current(a, b)
}

pub fn compare_simpler_short<'a, 'b>(a: &'a [u8; 1024 * 1024], b: &'b [u8; 8]) -> CompareResult {
    compare_simpler(a, b)
}

Given these two wrapper functions, rust always inlines the body:

example::compare_current_short:
        movzx   ecx, byte ptr [rdi]
        mov     al, 2
        cmp     cl, byte ptr [rsi]
        jne     .LBB2_8
        movzx   ecx, byte ptr [rdi + 1]
        cmp     cl, byte ptr [rsi + 1]
        jne     .LBB2_8
        movzx   ecx, byte ptr [rdi + 2]
        cmp     cl, byte ptr [rsi + 2]
        jne     .LBB2_8
        movzx   ecx, byte ptr [rdi + 3]
        cmp     cl, byte ptr [rsi + 3]
        jne     .LBB2_8
        movzx   ecx, byte ptr [rdi + 4]
        cmp     cl, byte ptr [rsi + 4]
        jne     .LBB2_8
        movzx   ecx, byte ptr [rdi + 5]
        cmp     cl, byte ptr [rsi + 5]
        jne     .LBB2_8
        movzx   ecx, byte ptr [rdi + 6]
        cmp     cl, byte ptr [rsi + 6]
        jne     .LBB2_8
        movzx   eax, byte ptr [rdi + 7]
        cmp     al, byte ptr [rsi + 7]
        setne   al
        add     al, al
.LBB2_8:
        ret

example::compare_simpler_short:
        mov     rax, qword ptr [rdi]
        cmp     rax, qword ptr [rsi]
        setne   al
        add     al, al
        ret

Uhh...

It seems as though the loop made LLVM unable to reason about the intent and the best it could do was just unroll it rather than fully optimise it in the same way it could the simpler implementation.

For the curious, the story is basically the same on aarch64, except the memcmp isn't lowered to the (potentially faster) bcmp for some reason.

example::compare_current:
        cmp     x1, x3
        csel    x8, x1, x3, lo
LBB0_1:
        cbz     x8, LBB0_4
        ldrb    w9, [x0], #1
        ldrb    w10, [x2], #1
        sub     x8, x8, #1
        cmp     w9, w10
        b.eq    LBB0_1
        mov     w0, #2
        ret
LBB0_4:
        cmp     x1, x3
        cset    w0, lo
        ret

example::compare_simpler:
        stp     x20, x19, [sp, #-32]!
        stp     x29, x30, [sp, #16]
        add     x29, sp, #16
        mov     x19, x3
        mov     x8, x2
        mov     x20, x1
        cmp     x1, x3
        csel    x2, x1, x3, lo
        mov     x1, x8
        bl      _memcmp
        cmp     x19, x20
        cset    w8, hi
        mov     w9, #2
        cmp     w0, #0
        csel    w0, w8, w9, eq
        ldp     x29, x30, [sp, #16]
        ldp     x20, x19, [sp], #32
        ret

example::compare_current_short:
        ldrb    w8, [x0]
        ldrb    w9, [x1]
        cmp     w8, w9
        b.ne    LBB2_8
        ldrb    w8, [x0, #1]
        ldrb    w9, [x1, #1]
        cmp     w8, w9
        b.ne    LBB2_8
        ldrb    w8, [x0, #2]
        ldrb    w9, [x1, #2]
        cmp     w8, w9
        b.ne    LBB2_8
        ldrb    w8, [x0, #3]
        ldrb    w9, [x1, #3]
        cmp     w8, w9
        b.ne    LBB2_8
        ldrb    w8, [x0, #4]
        ldrb    w9, [x1, #4]
        cmp     w8, w9
        b.ne    LBB2_8
        ldrb    w8, [x0, #5]
        ldrb    w9, [x1, #5]
        cmp     w8, w9
        b.ne    LBB2_8
        ldrb    w8, [x0, #6]
        ldrb    w9, [x1, #6]
        cmp     w8, w9
        b.ne    LBB2_8
        ldrb    w8, [x0, #7]
        ldrb    w9, [x1, #7]
        cmp     w8, w9
        cset    w8, ne
        lsl     w0, w8, #1
        ret
LBB2_8:
        mov     w0, #2
        ret

example::compare_simpler_short:
        ldr     x8, [x0]
        ldr     x9, [x1]
        cmp     x8, x9
        cset    w8, ne
        lsl     w0, w8, #1
        ret

epage commented 11 months ago

I tried this change on a related parser combinator library where the bench is parsing a very large json file

Iterator: 9.5 ms
Equality: 11.6 ms

MTCoster commented 11 months ago

I tried this change on a related parser combinator library where the bench is parsing a very large json file

Iterator: 9.5 ms

Equality: 11.6 ms

Would you mind sharing the project and JSON file? I'd love to poke around at the binaries to see what's going on

epage commented 11 months ago

Would you mind sharing the project and JSON file? I'd love to poke around at the binaries to see what's going on

I ran cargo bench --bench json -- basic/canada

epage commented 11 months ago

Something I suspect is what your code gets inlined into has as much affect as your actual code on performance.

A possible alternative experiment is to not inline the compare call. Those are a mix of inherited from nom or added later. In a lot of cases, they dramatically helped with performance but I've also found cases where they hurt.

Geal commented 11 months ago

This is interesting, and it's always a good idea to revisit old optimisations to see if they still hold up. I'll look a bit into it. I suspect here that if the simpler version is slower, it is due to the overhead of calling into bcmp. Compare in nom is mainly used for very short strings, so for those a small loop might be faster

rust-bakery / nom

Optimisation of `impl Compare<&[u8]> for &[u8]` is not optimal #1676

An analysis