Open Kobzol opened 10 months ago
Hi!
I've investigated a bit and I am sharing what I gathered so far.
Disclaimer: I have mostly investigated he fact that the iterator based version is not marked nocapture
, but I don't know whether it is the cause for the missed CSE opportunity.
The main difference between the index based version and the iterator one is that the iterator ones increment pointers directly. It is basically equivalent to the following C code (which LLVM doesn't manage to mark as nocapture
either):
size_t __attribute__ ((noinline)) add_iter(const bool *begin, const bool *end) {
size_t count = 0;
for (const bool *ptr = begin; ptr < end; ++ptr) {
if (*ptr) {
count += 1;
}
}
return count;
}
add_index
%count.07 = phi i64 [ %spec.select, %bb5 ], [ 0, %start ]
%iter.sroa.0.06 = phi i64 [ %_0.i, %bb5 ], [ 0, %start ]
add_iter
%count.05 = phi i64 [ %spec.select, %bb3 ], [ 0, %start ]
%iter.sroa.0.04 = phi ptr [ %_30.i, %bb3 ], [ %bools.0, %start ]
I have investigated this further.
Disclaimer: I have mostly investigated he fact that the iterator based version is not marked
nocapture
, but I don't know whether it is the cause for the missed CSE opportunity.
So it turns out that the CSE happens before functions are marked nocapture
. Both could share a common cause, but the missed capturing analysis is not the cause of the missed CSE.
I have not figured why the CSE does not happen. However, I came up with a slightly different MRE that includes the CSE-able function being memchr.
#[inline(never)]
fn has_zero_index(xs: &[u8]) -> bool {
for i in 0..xs.len() {
if xs[i] == 0 {
return true
}
}
false
}
#[inline(never)]
pub fn has_zero_memchr(xs: &[u8]) -> bool {
xs.contains(&0)
}
#[inline(never)]
pub fn has_zero_iter(xs: &[u8]) -> bool {
xs.iter().any(|&x| x == 0)
}
#[inline(never)]
fn has_zero_ptr(xs: &[u8]) -> bool {
let range = xs.as_ptr_range();
let mut start = range.start;
let end = range.end;
while start < end {
unsafe {
if *start == 0 {
return true
}
start = start.add(1);
}
}
false
}
pub fn foo_index(xs: &[u8]) {
// CSE
println!("a: {}", has_zero_index(xs));
println!("b: {}", has_zero_index(xs));
}
pub fn foo_memchr(xs: &[u8]) {
// No CSE
println!("a: {}", has_zero_memchr(xs));
println!("b: {}", has_zero_memchr(xs));
}
pub fn foo_iter(xs: &[u8]) {
// No CSE
println!("a: {}", has_zero_iter(xs));
println!("b: {}", has_zero_iter(xs));
}
pub fn foo_ptr(xs: &[u8]) {
// No CSE
println!("a: {}", has_zero_ptr(xs));
println!("b: {}", has_zero_ptr(xs));
}
So it turns out that the CSE happens before functions are marked
nocapture
. Both could share a common cause, but the missed capturing analysis is not the cause of the missed CSE.I have not figured why the CSE does not happen. However, I came up with a slightly different MRE that includes the CSE-able function being memchr.
The reason should be the lack of and nounwind
memory(argmem: read)
.
Maybe this is due to #111603 (comment) and llvm/llvm-project#74228 ?
It's a similar issue. Based on https://github.com/llvm/llvm-project/pull/74228, the following changes can solve this issue.
--- a/llvm/lib/Transforms/IPO/FunctionAttrs.cpp
+++ b/llvm/lib/Transforms/IPO/FunctionAttrs.cpp
@@ -118,7 +118,7 @@ static void addLocAccess(MemoryEffects &ME, const MemoryLocation &Loc,
if (isNoModRef(MR))
return;
- const Value *UO = getUnderlyingObject(Loc.Ptr);
+ const Value *UO = getUnderlyingObjectLookThrough(Loc.Ptr);
assert(!isa<AllocaInst>(UO) &&
"Should have been handled by getModRefInfoMask()");
if (isa<Argument>(UO)) {
Maybe @caojoshua will submit a new PR after this one. cc @caojoshua.
Fixed by https://github.com/llvm/llvm-project/pull/100102 at LLVM 20. @rustbot label +llvm-fixed-upstream
While examining a weird performance regression on Zulip, we noticed a peculiar thing caused by a simple for loop.
Consider this code:
These two functions do the same thing, but the first one uses slice indexing, while the other one uses a more natural for loop. One would expect that the second version will optimize better (or same).
However, when these functions are called multiple times, only the first function is eligible for CSE (common subexpression elimination):
Link to Godbolt.
When examining these two functions, I noticed that the indexed version uses
nocapture
for its argument, while the for loop version doesn't:which might be the reason of why CSE isn't applied.
It seems weird that the canonical way of iterating over a slice produces worse code (or at least function attributes) than manual indexing.