rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
97.36k stars 12.59k forks source link

Inefficient codegen collecting slice iterator into array #126000

Open orlp opened 4 months ago

orlp commented 4 months ago

While working on Itertools::collect_array for itertools I wanted to compare the efficiency of

pub fn try_into(sl: &[u32]) -> Option<[u32; N]> {
    Some(*<&[u32; N]>::try_from(sl.get(0..N)?).unwrap())
}

pub fn collect_array(sl: &[u32]) -> Option<[u32; N]> {
    sl.iter().copied().collect_array()
}

You can see my comparison here: https://rust.godbolt.org/z/qPW3K8aTx.

I was rather shocked, both look good for N = 4, but for N = 16 we see the following nice implementation for try_into:

try_into:
        mov     rax, rdi
        xor     ecx, ecx
        cmp     rdx, 15
        jbe     .LBB0_2
        vmovups ymm0, ymmword ptr [rsi + 32]
        vmovups ymmword ptr [rax + 36], ymm0
        vmovups ymm0, ymmword ptr [rsi]
        vmovups ymmword ptr [rax + 4], ymm0
        mov     ecx, 1
.LBB0_2:
        mov     dword ptr [rax], ecx
        vzeroupper
        ret

But the following abomination for collect_array:

collect_array:
        mov     rax, rdi
        xor     ecx, ecx
        test    rdx, rdx
        je      .LBB1_14
        cmp     rdx, 1
        je      .LBB1_14
        cmp     rdx, 2
        je      .LBB1_14
        cmp     rdx, 3
        je      .LBB1_14
        cmp     rdx, 4
        je      .LBB1_14
        cmp     rdx, 5
        je      .LBB1_14
        cmp     rdx, 6
        je      .LBB1_14
        cmp     rdx, 7
        je      .LBB1_14
        cmp     rdx, 8
        je      .LBB1_14
        cmp     rdx, 9
        je      .LBB1_14
        cmp     rdx, 10
        je      .LBB1_14
        cmp     rdx, 11
        je      .LBB1_14
        and     rdx, -4
        cmp     rdx, 12
        je      .LBB1_14
        push    rbp
        push    r15
        push    r14
        push    r12
        push    rbx
        mov     ecx, dword ptr [rsi]
        mov     edx, dword ptr [rsi + 4]
        mov     edi, dword ptr [rsi + 8]
        mov     r8d, dword ptr [rsi + 12]
        mov     r9d, dword ptr [rsi + 16]
        mov     r10d, dword ptr [rsi + 20]
        mov     r11d, dword ptr [rsi + 24]
        mov     ebx, dword ptr [rsi + 28]
        mov     ebp, dword ptr [rsi + 32]
        mov     r14d, dword ptr [rsi + 36]
        mov     r15d, dword ptr [rsi + 40]
        mov     r12d, dword ptr [rsi + 44]
        mov     dword ptr [rax + 4], ecx
        mov     dword ptr [rax + 8], edx
        mov     dword ptr [rax + 12], edi
        mov     dword ptr [rax + 16], r8d
        mov     dword ptr [rax + 20], r9d
        mov     dword ptr [rax + 24], r10d
        mov     dword ptr [rax + 28], r11d
        mov     dword ptr [rax + 32], ebx
        mov     dword ptr [rax + 36], ebp
        mov     dword ptr [rax + 40], r14d
        mov     dword ptr [rax + 44], r15d
        mov     dword ptr [rax + 48], r12d
        vmovups xmm0, xmmword ptr [rsi + 48]
        vmovups xmmword ptr [rax + 52], xmm0
        mov     ecx, 1
        pop     rbx
        pop     r12
        pop     r14
        pop     r15
        pop     rbp
.LBB1_14:
        mov     dword ptr [rax], ecx
        ret

While it not folding the consecutive address movs into efficient SIMD movs is disappointing, I would argue that there's probably a bug somewhere since it compares rdx THIRTEEN TIMES IN A ROW to ultimately just check if it is < 16.

This doesn't appear to be a recent regression, the same happens in 1.60 through nightly, and it happens on both x86 as well as ARM.

orlp commented 4 months ago

And in case anyone wonders, the exact same bad code is generated if one uses arrayvec::ArrayVec rather than my ArrayBuilder:

pub fn next_array<I, T, const N: usize>(it: &mut I) -> Option<[T; N]>
where
    I: Iterator<Item = T>,
{
    let mut builder = arrayvec::ArrayVec::new();
    for _ in 0..N {
        builder.push(it.next()?);
    }
    builder.into_inner().ok()
}
the8472 commented 4 months ago

You could look at what next_chunk does in std, it uses MaybeUninit internally. And Copied has some specialization on top. Though itertool's job actually should be easier here since you're just returning an Option instead of a Result with a remainder.

DianQK commented 3 months ago

I reduced to the following code, but this is another issue:

#[no_mangle]
pub fn src(sl: &[u32]) -> Option<u32> {
    let mut r = 0;
    let it = &mut sl.iter();
    for _ in 0..2 {
        r += it.next()?;
    }
    Some(r)
}

#[no_mangle]
pub fn tgt(sl: &[u32]) -> Option<u32> {
    let mut r = 0;
    let it = &mut sl.iter();
    r += it.next()?;
    r += it.next()?;
    Some(r)
}

https://rust.godbolt.org/z/596qb413n