rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
96.7k stars 12.5k forks source link

Missed autovectorization for slice.iter.fold, works for slice.iter.copied.fold #113789

Open jhorstmann opened 1 year ago

jhorstmann commented 1 year ago

I was looking into optimizing a function that checks that all values in a slice are in range. It is not that surprising that the version with all does not get optimized because returning early (although in theory rust should be allowed to read more elements from the slice before breaking), but it is surprising that adding copied before folding makes a difference in autovectorization.

Sample code (https://rust.godbolt.org/z/5eznWbMcf):

pub fn check_range_all(keys: &[u32], max: u32) -> bool {
    keys.iter().all(|x| *x < max)
}

pub fn check_range_fold(keys: &[u32], max: u32) -> bool {
    keys.iter().fold(true, |a, x| a && *x < max)
}

pub fn check_range_copied_fold(keys: &[u32], max: u32) -> bool {
    keys.iter().copied().fold(true, |a, x| a && x < max)
}
the8472 commented 1 year ago
pub fn check_range_fold(keys: &[u32], max: u32) -> bool {
    keys.iter().fold(true, |a, x| a & (*x < max))
}

does vectorize. Then the question is why that's not necessary for the copied() iterator.

Jules-Bertholet commented 1 year ago

@rustbot label A-codegen I-heavy I-slow

jhorstmann commented 1 year ago

@the8472 thanks, that's a very good data point that I did not consider. That would be consistent with llvm not eagerly de-referencing. In fact this also gets vectorized:

pub fn check_range_fold(keys: &[u32], max: u32) -> bool {
    keys.iter().fold(true, |a, x| (*x < max) && a)
}

I could live with that behavior and explanation and would be ok with closing the issue.

the8472 commented 1 year ago

The IR for the closure does mark the pointer as dereferencable, which means it could be hoisted out of the comparison.

; example::check_range_fold::{{closure}}
; Function Attrs: inlinehint nonlazybind uwtable
define internal noundef zeroext i1 @"_ZN7example16check_range_fold28_$u7b$$u7b$closure$u7d$$u7d$17hcf54473b332a3b17E"(ptr noalias noundef align 8 dereferenceable(8) %_1, i1 noundef zeroext %a, ptr noalias noundef readonly align 4 dereferenceable(4) %x) unnamed_addr #0 !dbg !121 {
start:
  %_0 = alloca i8, align 1
  br i1 %a, label %bb2, label %bb1, !dbg !123

bb1:                                              ; preds = %start
  store i8 0, ptr %_0, align 1, !dbg !123
  br label %bb3, !dbg !123

bb2:                                              ; preds = %start
  %_5 = load i32, ptr %x, align 4, !dbg !124
  %_7 = load ptr, ptr %_1, align 8, !dbg !125
  %_6 = load i32, ptr %_7, align 4, !dbg !125
  %_4 = icmp ult i32 %_5, %_6, !dbg !124
  %0 = zext i1 %_4 to i8, !dbg !123
  store i8 %0, ptr %_0, align 1, !dbg !123
  br label %bb3, !dbg !123

bb3:                                              ; preds = %bb2, %bb1
  %1 = load i8, ptr %_0, align 1, !dbg !126
  %2 = trunc i8 %1 to i1, !dbg !126
  ret i1 %2, !dbg !126
}

After optimizations the unrolled loop loads the values unconditionally, so it's not like it found it beneficial to skip the memory accesses and bail out early

.LBB0_5:
        cmp     dword ptr [rdi + 4*r8], edx
        setb    r9b
        and     r9b, al
        cmp     dword ptr [rdi + 4*r8 + 4], edx
        setb    al
        cmp     dword ptr [rdi + 4*r8 + 8], edx
        setb    r10b
        and     r10b, al
        and     r10b, r9b
        cmp     dword ptr [rdi + 4*r8 + 12], edx
        setb    al
        cmp     dword ptr [rdi + 4*r8 + 16], edx
        setb    r9b
        and     r9b, al
        cmp     dword ptr [rdi + 4*r8 + 20], edx
        setb    r11b
        and     r11b, r9b
        and     r11b, r10b
        cmp     dword ptr [rdi + 4*r8 + 24], edx
        setb    r9b
        cmp     dword ptr [rdi + 4*r8 + 28], edx
        setb    al
        and     al, r9b
        and     al, r11b
        add     r8, 8
        cmp     rsi, r8
        jne     .LBB0_5

CC @nikic

jhorstmann commented 11 months ago

This issue looks similar to #105259, with slightly simpler code to reproduce it