flat_map fails to autovectorize internal iterators

james7132 commented 1 year ago

#[no_mangle]
fn iter(query: &mut [&mut [f32]]) {
    for segment in query {
        for value in segment.iter_mut() {
            *value += 1.0;
        }
    }
}

#[no_mangle]
fn flat_map(query: &mut [&mut [f32]]) {
    let iter = query.iter_mut().flat_map(|segment| segment.iter_mut());
    for value in iter {
        *value += 1.0;
    }
}

I expected to see this happen: Both versions of the function to autovectorize.

Instead, this happened: only the normal for-loop version autovectorizes. See https://godbolt.org/z/caTqoq3x3.

Meta

rustc --version --verbose:

rustc 1.65.0 (897e37553 2022-11-02)
binary: rustc
commit-hash: 897e37553bba8b42751c67658967889d11ecd120
commit-date: 2022-11-02
host: x86_64-pc-windows-msvc
release: 1.65.0
LLVM version: 15.0.0

Also happens on nightly and for other "flattened" iterators. For example, Bevy's QueryIter and Query::for_each show this exact API split due to this performance difference.

There seems to be #87411 that was targeting something similar.

the8472 commented 1 year ago

#[no_mangle]
fn flat_map(query: &mut [&mut [f32]]) {
    let iter = query.iter_mut().flat_map(|segment| segment.iter_mut());
    for value in iter {
        *value += 1.0;
    }
}

That's still external iteration. If you want internal iteration use

fn flat_map(query: &mut [&mut [f32]]) {
    query.iter_mut().flat_map(AsMut::as_mut).for_each(|value| *value += 1.0);
}

which does vectorize.

james7132 commented 1 year ago

Is there a way for us to get it to vectorize without internal iteration? The aforementioned case in Bevy cannot be separated like flat_map, and we'd rather encourage use of external iteration as it's generally more idiomatic.

the8472 commented 1 year ago

Well, for_each was more or less introduced for that purpose. The issue is the inner loop in Flatten's next() which makes it hard to optimize external iteration. There are some LLVM passes that are disabled even on O3 and the Polly plugin that can do some additional loop transformations but I assume there's a reason why they're off.

Other than that you could see if you can structure your data to return arrays or an iterator-of-iterators so users could write their own nested loops.

james7132 commented 1 year ago

The issue is the inner loop in Flatten's next() which makes it hard to optimize external iteration.

The implementation we have looks very similar.

Other than that you could see if you can structure your data to return arrays or an iterator-of-iterators so users could write their own nested loops.

This unfortunately is a 100% deal breaker since that fundamentally breaks the UX we're aiming to expose, and potentially forces unsafe into userspace code due to the underlying storage being type erased. We do have an internal iteration version, as mentioned before, and it does properly vectorize. However, its use is heavily discouraged right now as it's not extensible nor idiomatic. We were hoping to close this perf gap and remove the internal iteration option.

Rageking8 commented 1 year ago

@rustbot label +A-codegen +I-slow

the8472 commented 1 year ago

Query::for_each

This isn't an iterator method. You'll want to override the default try_fold and fold impls on your Iterator, similar to what various iterators in core do. That way your users will be able to use iterators and benefit from internal iteration when they call iterator methods that use internal iteration. That'll compose and provide better performance.

james7132 commented 1 year ago

This isn't an iterator method.

The intent was largely for UX (.iter().for_each() takes more to write out and screws with formatting) while also providing improved performance via internal iteration.

You'll want to override the default try_fold and fold impls on your Iterator, similar to what various iterators in core do.

Is it possible to implement those functions in stable Rust right now? I'm getting an error saying that try_trait_v2 is unstable.

Again ideally we wouldn't need this and the external iteration option optimizes to the same result. This isn't what I would call a zero-cost abstraction if users need to make a conscious choice to choose one over the other purely for performance purposes.

the8472 commented 1 year ago

Well, at least fold should be possible to implement on stable rust, which will be used by for_each and some other methods.

Again ideally we wouldn't need this and the external iteration option optimizes to the same result. This isn't what I would call a zero-cost abstraction if users need to make a conscious choice to choose one over the other purely for performance purposes.

Yes, that's a long-known issue (see the blog post linked previously) but there aren't any good plans how to fix it.

rust-lang / rust

flat_map fails to autovectorize internal iterators #104914

Meta