Missing CSE caused by for loop on slice

Kobzol commented 10 months ago

While examining a weird performance regression on Zulip, we noticed a peculiar thing caused by a simple for loop.

Consider this code:

#[inline(never)]
fn add_index(bools: &[bool]) -> usize {
    let mut count: usize = 0;
    for i in 0..bools.len() {
        if bools[i] {
            count += 1;
        }
    }
    count
}

#[inline(never)]
fn add_iter(bools: &[bool]) -> usize {
    let mut count: usize = 0;
    for b in bools {
        if *b {
            count += 1;
        }
    }
    count
}

These two functions do the same thing, but the first one uses slice indexing, while the other one uses a more natural for loop. One would expect that the second version will optimize better (or same).

However, when these functions are called multiple times, only the first function is eligible for CSE (common subexpression elimination):

pub fn foo1(bools: &[bool]) {
    // CSE applied
    println!("a: {}", add_index(bools));
    println!("b: {}", add_index(bools));
}

pub fn foo2(bools: &[bool]) {
    // CSE not applied
    println!("a: {}", add_iter(bools));
    println!("b: {}", add_iter(bools));
}

Link to Godbolt.

When examining these two functions, I noticed that the indexed version uses nocapture for its argument, while the for loop version doesn't:

define noundef i64 @add_index(ptr noalias nocapture noundef nonnull readonly align 1 %bools.0, i64 noundef %bools.1) unnamed_addr #0 !dbg !7 {

define noundef i64 @add_iter(ptr noalias noundef nonnull readonly align 1 %bools.0, i64 noundef %bools.1) unnamed_addr #1 !dbg !57 {

which might be the reason of why CSE isn't applied.

It seems weird that the canonical way of iterating over a slice produces worse code (or at least function attributes) than manual indexing.

the8472 commented 10 months ago

Maybe this is due to https://github.com/rust-lang/rust/issues/111603#issuecomment-1566549634 and https://github.com/llvm/llvm-project/pull/74228 ?

krtab commented 10 months ago

Hi!

I've investigated a bit and I am sharing what I gathered so far.

Disclaimer: I have mostly investigated he fact that the iterator based version is not marked nocapture, but I don't know whether it is the cause for the missed CSE opportunity.

The main difference between the index based version and the iterator one is that the iterator ones increment pointers directly. It is basically equivalent to the following C code (which LLVM doesn't manage to mark as nocapture either):

size_t __attribute__ ((noinline)) add_iter(const bool *begin, const bool *end) {
    size_t count = 0;
    for (const bool *ptr = begin; ptr < end; ++ptr) {
        if (*ptr) {
            count += 1;
        }
    }
    return count;
}

The nocapture attribute is given to add_index by the LLVM pass [PostOrderFunctionAttrsPass](https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(filename:'1',fontScale:14,fontUsePx:'0',j:1,lang:rust,selection:(endColumn:1,endLineNumber:33,positionColumn:1,positionLineNumber:33,selectionStartColumn:1,selectionStartLineNumber:33,startColumn:1,startLineNumber:33),source:'%23%5Binline(never)%5D%0Afn+add_index(bools:+%26%5Bbool%5D)+-%3E+usize+%7B%0A++++let+mut+count:+usize+%3D+0%3B%0A++++for+i+in+0..bools.len()+%7B%0A++++++++if+bools%5Bi%5D+%7B%0A++++++++++++count+%2B%3D+1%3B%0A++++++++%7D%0A++++%7D%0A++++count%0A%7D%0A%23%5Binline(never)%5D%0Afn+add_iter(bools:+%26%5Bbool%5D)+-%3E+usize+%7B%0A++++let+mut+count:+usize+%3D+0%3B%0A++++for+b+in+bools+%7B%0A++++++++if+*b+%7B%0A++++++++++++count+%2B%3D+1%3B%0A++++++++%7D%0A++++%7D%0A++++count%0A%7D%0A%0Apub+fn+foo1(bools:+%26%5Bbool%5D)+%7B%0A++++//+CSE%0A++++println!!(%22a:+%7B%7D%22,+add_index(bools))%3B%0A++++println!!(%22b:+%7B%7D%22,+add_index(bools))%3B%0A%7D%0A%0Apub+fn+foo2(bools:+%26%5Bbool%5D)+%7B%0A++++//+No+CSE%0A++++println!!(%22a:+%7B%7D%22,+add_iter(bools))%3B%0A++++println!!(%22b:+%7B%7D%22,+add_iter(bools))%3B%0A%7D%0A'),l:'5',n:'0',o:'Rust+source+%231',t:'0')),k:33.37427523425665,l:'4',m:100,n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:nightly,filters:(b:'1',binary:'1',binaryObject:'1',commentOnly:'1',debugCalls:'1',demangle:'0',directives:'1',execute:'1',intel:'0',libraryCode:'1',trim:'1'),flagsViewOpen:'1',fontScale:14,fontUsePx:'0',j:2,lang:rust,libs:!(),options:'-O',overrides:!((name:edition,value:'2021')),selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1),l:'5',n:'0',o:'+rustc+nightly+(Editor+%231)',t:'0')),k:33.292391432410035,l:'4',n:'0',o:'',s:0,t:'0'),(g:!((h:optPipelineView,i:('-fno-discard-value-names':'0',compilerName:'rustc+nightly',demangle-symbols:'0',dump-full-module:'1',editorid:1,filter-debug-info:'0',filter-inconsequential-passes:'0',filter-instruction-metadata:'0',fontScale:14,fontUsePx:'0',j:2,selectedGroup:'example::add_index',selectedIndex:70,sidebarWidth:250,treeid:0),l:'5',n:'0',o:'Opt+Pipeline+Viewer+rustc+nightly+(Editor+%231,+Compiler+%232)',t:'0')),j:__glMaximised,k:33.33333333333333,l:'4',n:'0',o:'',s:0,t:'0')),l:'2',n:'0',o:'',t:'0')),version:4). However this pass doesn't manage to give it to add_iter. This may be because bools.0 appears in the phi nodes in add_iter:

add_index

  %count.07 = phi i64 [ %spec.select, %bb5 ], [ 0, %start ]
  %iter.sroa.0.06 = phi i64 [ %_0.i, %bb5 ], [ 0, %start ]

add_iter

  %count.05 = phi i64 [ %spec.select, %bb3 ], [ 0, %start ]
  %iter.sroa.0.04 = phi ptr [ %_30.i, %bb3 ], [ %bools.0, %start ]

krtab commented 10 months ago

I have investigated this further.

Disclaimer: I have mostly investigated he fact that the iterator based version is not marked nocapture, but I don't know whether it is the cause for the missed CSE opportunity.

So it turns out that the CSE happens before functions are marked nocapture. Both could share a common cause, but the missed capturing analysis is not the cause of the missed CSE.

I have not figured why the CSE does not happen. However, I came up with a slightly different MRE that includes the CSE-able function being memchr.

Gobolt

#[inline(never)]
fn has_zero_index(xs: &[u8]) -> bool {
    for i in 0..xs.len() {
        if xs[i] == 0 {
            return true
        }
    }
    false
}

#[inline(never)]
pub fn has_zero_memchr(xs: &[u8]) -> bool {
    xs.contains(&0)
}

#[inline(never)]
pub fn has_zero_iter(xs: &[u8]) -> bool {
    xs.iter().any(|&x| x == 0)
}

#[inline(never)]
fn has_zero_ptr(xs: &[u8]) -> bool {
    let range = xs.as_ptr_range();
    let mut start = range.start;
    let end = range.end;
    while start < end {
        unsafe {
            if *start == 0 {
                return true
            }
            start = start.add(1);
        }
    }
    false
}

pub fn foo_index(xs: &[u8])  {
    // CSE
    println!("a: {}", has_zero_index(xs));
    println!("b: {}", has_zero_index(xs));
}

pub fn foo_memchr(xs: &[u8]) {
    // No CSE
    println!("a: {}", has_zero_memchr(xs));
    println!("b: {}", has_zero_memchr(xs));
}

pub fn foo_iter(xs: &[u8]) {
    // No CSE
    println!("a: {}", has_zero_iter(xs));
    println!("b: {}", has_zero_iter(xs));
}

pub fn foo_ptr(xs: &[u8]) {
    // No CSE
    println!("a: {}", has_zero_ptr(xs));
    println!("b: {}", has_zero_ptr(xs));
}

DianQK commented 10 months ago

So it turns out that the CSE happens before functions are marked nocapture. Both could share a common cause, but the missed capturing analysis is not the cause of the missed CSE.

I have not figured why the CSE does not happen. However, I came up with a slightly different MRE that includes the CSE-able function being memchr.

The reason should be the lack of ~~nounwind~~ and memory(argmem: read).

DianQK commented 10 months ago

Maybe this is due to #111603 (comment) and llvm/llvm-project#74228 ?

It's a similar issue. Based on https://github.com/llvm/llvm-project/pull/74228, the following changes can solve this issue.

--- a/llvm/lib/Transforms/IPO/FunctionAttrs.cpp
+++ b/llvm/lib/Transforms/IPO/FunctionAttrs.cpp
@@ -118,7 +118,7 @@ static void addLocAccess(MemoryEffects &ME, const MemoryLocation &Loc,
   if (isNoModRef(MR))
     return;

-  const Value *UO = getUnderlyingObject(Loc.Ptr);
+  const Value *UO = getUnderlyingObjectLookThrough(Loc.Ptr);
   assert(!isa<AllocaInst>(UO) &&
          "Should have been handled by getModRefInfoMask()");
   if (isa<Argument>(UO)) {

Maybe @caojoshua will submit a new PR after this one. cc @caojoshua.

DianQK commented 4 months ago

Fixed by https://github.com/llvm/llvm-project/pull/100102 at LLVM 20. @rustbot label +llvm-fixed-upstream

rust-lang / rust

Missing CSE caused by for loop on slice #119573