Open james7132 opened 1 year ago
#[no_mangle]
fn flat_map(query: &mut [&mut [f32]]) {
let iter = query.iter_mut().flat_map(|segment| segment.iter_mut());
for value in iter {
*value += 1.0;
}
}
That's still external iteration. If you want internal iteration use
fn flat_map(query: &mut [&mut [f32]]) {
query.iter_mut().flat_map(AsMut::as_mut).for_each(|value| *value += 1.0);
}
which does vectorize.
Is there a way for us to get it to vectorize without internal iteration? The aforementioned case in Bevy cannot be separated like flat_map
, and we'd rather encourage use of external iteration as it's generally more idiomatic.
Well, for_each
was more or less introduced for that purpose. The issue is the inner loop in Flatten's next() which makes it hard to optimize external iteration. There are some LLVM passes that are disabled even on O3 and the Polly plugin that can do some additional loop transformations but I assume there's a reason why they're off.
Other than that you could see if you can structure your data to return arrays or an iterator-of-iterators so users could write their own nested loops.
The issue is the inner loop in Flatten's next() which makes it hard to optimize external iteration.
The implementation we have looks very similar.
Other than that you could see if you can structure your data to return arrays or an iterator-of-iterators so users could write their own nested loops.
This unfortunately is a 100% deal breaker since that fundamentally breaks the UX we're aiming to expose, and potentially forces unsafe
into userspace code due to the underlying storage being type erased. We do have an internal iteration version, as mentioned before, and it does properly vectorize. However, its use is heavily discouraged right now as it's not extensible nor idiomatic. We were hoping to close this perf gap and remove the internal iteration option.
@rustbot label +A-codegen +I-slow
This isn't an iterator method. You'll want to override the default try_fold
and fold
impls on your Iterator
, similar to what various iterators in core do. That way your users will be able to use iterators and benefit from internal iteration when they call iterator methods that use internal iteration. That'll compose and provide better performance.
This isn't an iterator method.
The intent was largely for UX (.iter().for_each()
takes more to write out and screws with formatting) while also providing improved performance via internal iteration.
You'll want to override the default try_fold and fold impls on your Iterator, similar to what various iterators in core do.
Is it possible to implement those functions in stable Rust right now? I'm getting an error saying that try_trait_v2
is unstable.
Again ideally we wouldn't need this and the external iteration option optimizes to the same result. This isn't what I would call a zero-cost abstraction if users need to make a conscious choice to choose one over the other purely for performance purposes.
Well, at least fold
should be possible to implement on stable rust, which will be used by for_each
and some other methods.
Again ideally we wouldn't need this and the external iteration option optimizes to the same result. This isn't what I would call a zero-cost abstraction if users need to make a conscious choice to choose one over the other purely for performance purposes.
Yes, that's a long-known issue (see the blog post linked previously) but there aren't any good plans how to fix it.
I expected to see this happen: Both versions of the function to autovectorize.
Instead, this happened: only the normal for-loop version autovectorizes. See https://godbolt.org/z/caTqoq3x3.
Meta
rustc --version --verbose
:Also happens on nightly and for other "flattened" iterators. For example, Bevy's QueryIter and Query::for_each show this exact API split due to this performance difference.
There seems to be #87411 that was targeting something similar.