Open bradleyharden opened 3 years ago
Perhaps it's because those cycle()?
Unfortunately, this is about the limit of my debugging ability currently. With some guidance, I might be able to do more of the leg work on my own.
Try compiling benchmarks with debuginfo = 2
and then running them under perf record --call-graph dwarf
and perf report
(or some GUI equivalent). That can get you the assembly + corresponding source lines + whole call tree including inlining + information which instructions take most of the time. The usefulness of source annotation varies though due to the heavy optimizations.
Perhaps it's because those cycle()?
Possible. zip
has a specialized implementation that applies when both iterators implement TrustedRandomAccess
, which Cycle
currently doesn't but probably could if the underlying one does.
Edit: nevermind, it can't implement that because TRA must have a finite size, which Cycle
doesn't have.
I've found zip
to be the main cause for slowness, because it has to check whenever the first item terminates, and has trouble reasoning about the length of iterators
@JulianKnodt and @the8472, if zip + cycle is the likely culprit and there's nothing that can done to improve it, is the best solution a manual implementation, like I did? Is it still worth it for me to dig deeper into debugging?
This is a purely personal POV, so take it as you will, but I think certainly the manual implementation is fine and it's not always worth making very neat abstractions due to increased burden on the compiler. That being said, there are probably spectializations that can be added to make zip & cycle better, so if you're really focusing on implementing FFT, you probably want to stick to the manual impl, but if you're doing it as an exercise it could be good to look at the library implementations of those and see if anything can be added to speed it up.
For the particular use case at hand, I think it probably makes sense to stick with the manual implementation. I need to calculate the FFT coefficients for a single frequency in an embedded context with a real-time deadline.
However, before I discovered the poor performance, I remember being impressed by this specific iteration example, because I was able express the entire problem in terms of iterator adapters (cycle, skip and zip). I was excited to share it with some of my coworkers as a great example of the zero-cost abstractions that Rust could offer. But then I found out it's not so zero-cost, which is disappointing.
I might look into it more, but I don't know that I really have the expertise to improve the implementation. I would think these issues have already been considered in depth by people with a lot more experience in compilers.
I would think these issues have already been considered in depth by people with a lot more experience in compilers.
Very smart and dedicated people have worked on the stdlib, but the stdlib isn't perfect nor complete. If you see a possible hole, then see if it's worth filling, if you have time to do it :-) Perhaps cycle+zip is worth improving.
I played around with the example just now and found something interesting. Apparently zip
and cycle
are not the problem, it's actually skip
. Removing skip
, or giving it an argument of 0
, is enough to erase the performance differences.
Based on that, does anyone have any immediate insight into what the problem might be?
Looks like it's the interaction between cycle
and skip
. Interesting.
This compiles to a constant:
pub fn skip_manual() -> i16 {
let mut sum = 0;
for i in 0..LUT_LEN {
sum += LUT[(i + 2) % LUT_LEN];
}
sum
}
But this does not.
pub fn skip() -> i16 {
LUT.iter().cycle().skip(2).take(8).sum()
}
Yet this does
pub fn skip() -> i16 {
LUT.iter().skip(2).sum()
}
Interesting, I think skip
+ iteration still adds some complexity, but I imagine having skip of 0 means that a bunch of branches can be compiled out from the impl. It might be because cycle doesn't implement any traits like TrustedLen
etc, and skip propagates those impls, & zip then uses those for specialization. @the8472 where does it mention that TRA needs finite size? I looked for it briefly but couldn't find it
You can also remove the skip just by iterating thru eagerly and it removes some branches:
pub fn fft(samples: &[i16; 16]) -> (i32, i32) {
let sin = LUT.iter().cycle();
let mut cos = LUT.iter().cycle();
cos.nth(LUT.len() / 4);
...
}
Can someone explain what's going on here? On both Godbolt and with cargo-asm
, I'm getting strange behavior.
In this example, skip
, cycle_take
and cycle_nth_take
all compile to the same function, which simply returns a constant. However, if you uncomment cycle_skip_take
, then cycle_nth_take
no longer compiles to a constant. How can that be? Is that some weird artifact of both Godbolt and cargo-asm
? Or is it a real bug?
Another approach, I haven't benchmarked it but on O3 it gives branch-free assembly only consisting of memory and simple arithmetic operations, similar to @JulianKnodt's approach with nth
.
O2 doesn't look as rosy, producing a loop and two multiplications.
const LUT: [i16; 8] = [0, 1, 2, 3, 4, 5, 6, 7];
pub fn fft(samples: &[i16; 16]) -> (i32, i32) {
let sin = (0..).map(|i| LUT[i % LUT.len()]);
let cos = ((LUT.len()/4)..).map(|i| LUT[i % LUT.len()]);
sin.zip(cos).zip(samples).fold(
(0, 0), |(mut real, mut imag), ((sin, cos), &sample)| {
real += sample as i32 * (cos as i32);
imag += sample as i32 * (sin as i32);
(real, imag)
})
}
In this case skip
can actually be used for cos
, at least on O3.
@JulianKnodt
@the8472 where does it mention that TRA needs finite size? I looked for it briefly but couldn't find it
It can only be exact (upper bound == lower bound) if an upper bound exists which means it needs to fit into usize.
@bradleyharden
How can that be? Is that some weird artifact of both Godbolt and
cargo-asm
? Or is it a real bug?
I would guess that llvm is simply making different inlining decisions based on the number of calls to one of the functions in the iterator chain and duplicating parts of that code goes over a threshold. Iterator optimizations are heavily dependent on inlining.
I would guess that llvm is simply making different inlining decisions based on the number of calls to one of the functions in the iterator chain and duplicating parts of that code goes over a threshold. Iterator optimizations are heavily dependent on inlining.
@the8472, I just want to clarify, in case there was any miscommunication. In that example the same function gives two different results based on the presence or absence of a different function. It doesn't seem like inlining could explain that. It seems more like it's skipping an optimization pass when another function is present, or something.
@the8472, I just want to clarify, in case there was any miscommunication. In that example the same function gives two different results based on the presence or absence of a different function. It doesn't seem like inlining could explain that. It seems more like it's skipping an optimization pass when another function is present, or something.
That seems very similar to a situation i've encountered when debugging https://github.com/rust-lang/rust/issues/79246 Some code was not inlined in the same way depending of just the number of symbols present or even to some unrelated variable declared.
@marmeladema, do you have a minimal example of that? Do you think it's worth making a separate issue about it? I can add my example from above.
@the8472, I think the skip
/cycle
interaction is worth investigating more. But for now, I wanted to report that moving to nth
was able to resolve my issue in the actual embedded context. Previously, the iterator approach was taking 1175 clock cycles, compared to 141 for the manual approach. After switching from skip
to nth
, both approaches now give 141 clock cycles.
pub fn fft<N>(samples: &GenericArray<i16, Prod<N, LutLen>>) -> (I16F16, I16F16)
where
N: Mul<LutLen> + Unsigned,
Prod<N, LutLen>: ArrayLength<i16>,
{
const ZERO: I16F16 = I16F16::from_bits(0);
let sin = LUT.iter().cycle();
let mut cos = LUT.iter().cycle();
cos.nth(LUT.len() / 4 - 1); // Use nth not skip, so iterator is optimized
sin.zip(cos).zip(samples).fold(
(ZERO, ZERO), |(mut real, mut imag), ((&sin, &cos), &sample)| {
let sample = I1F15::from_bits(sample);
real += I16F16::from_num(sample.wide_mul(cos));
imag += I16F16::from_num(sample.wide_mul(sin));
(real, imag)
})
}
@bradleyharden
I just want to clarify, in case there was any miscommunication. In that example the same function gives two different results based on the presence or absence of a different function. It doesn't seem like inlining could explain that. It seems more like it's skipping an optimization pass when another function is present, or something.
I was talking about inlining of the callees. If you uncomment the function the iterator methods are used in more places which can affect inlining decisions.
To help demonstrate what was said above #[inline]
above fn cycle_nth_take()
will compile it to a constant even with fn cycle_skip_take()
uncommented, which can be seen if you add a call site for it elsewhere.
I just want to clarify, in case there was any miscommunication. In that example the same function gives two different results based on the presence or absence of a different function. It doesn't seem like inlining could explain that. It seems more like it's skipping an optimization pass when another function is present, or something.
@bradleyharden I suspect that is caused by the way that codegne unit (CGU) partitioning ends up happening. My guess is that the presence/absence of the other functions ends up changing which functions get placed in the same CGU, which may in turn affect which optimizations LLVM decides to perform.
Can you try setting [profile.release] codegen-units = 1]
in your Cargo.toml
?
@JulianKnodt, adding #[inline]
doesn't make a difference. I think I can explain why you were seeing a difference. If I add #[inline]
to both cycle_nth_take()
and cycle_skip_take()
, then both functions disappear, as expected. If I add a call site for cycle_nth_take()
but not for cycle_skip_take()
, then it is able to compile it to a constant. But if I then add a call site for cycle_skip_take()
as well, I can observe the same behavior; cycle_nth_take()
will no longer compile to a constant.
@Aaron1011, adding codegen-units = 1
makes no difference to any of this behavior.
Is this expected behavior? Or should I create a separate issue about it? I'm still a bit skeptical of @the8472's callee inlining explanation. cycle_nth_take()
is able to compile to a constant both on it's own and in the presence of skip()
and cycle_take()
. It's only when cycle_skip_take()
is present and compiled that it stops compiling to a constant.
Both .skip()
and .cycle()
have a branch within their .next()
methods. It seems like the compiler is having trouble propagating its compile-time knowledge through these branch points. Here's one idea I had to improve it.
Right now, .skip()
stores its argument n
as a usize
, and it uses 0
as a sentinel value to indicate when it has already skipped ahead. Instead, I thought the compiler might have an easier time if we separated the boolean (need to skip vs. already skipped) from the number of elements to skip. I created a CustomSkip
struct that stores the argument as an Option<usize>
instead. Here is my implementation:
pub struct CustomSkip<I> {
iter: I,
n: Option<usize>,
}
impl<I> CustomSkip<I> {
pub fn new(iter: I, n: usize) -> CustomSkip<I> {
CustomSkip { iter, n: Some(n) }
}
}
impl<I: Iterator> Iterator for CustomSkip<I> {
type Item = I::Item;
#[inline]
fn next(&mut self) -> Option<I::Item> {
if let Some(n) = self.n.take() {
self.iter.nth(n)
} else {
self.iter.next()
}
}
}
trait SkipExt: Iterator + Sized {
fn custom_skip(self, n: usize) -> CustomSkip<Self> {
CustomSkip::new(self, n)
}
}
impl<I: Iterator> SkipExt for I {}
Indeed, this does seem to improve all of the examples discussed so far. But it only improves the original fft
example slightly, by less than a factor of 2. And using .nth()
on the fft
cos
iterator still produces much better results than using .skip()
or .custom_skip()
.
Another approach I have seen in some places in std is unrolling the first loop iteration so that the compiler has an easier time with the loop for the remainder. That could be tried by implementing fold
in Zip
.
@bradleyharden do you think you could see if
#[inline]
fn next(&mut self) -> Option<I::Item> {
if unlikely(self.n > 0) {
self.iter.nth(self.n-1);
self.n=0;
}
self.iter.next()
}
with n: usize
works as well for your case as the n: Option<usize>
? The asm looks better than the original.
I'm not sure if this qualifies as a bug or not.
In short, the following two functions give the same output, but
fft
performs about 10x worse thanfft_manaul
.I originally ran into this in an embedded context, targeting
thumbv7em-none-eabihf
. The original function took aGenericArray
and used fixed-point types from thefixed
crate, but I was able to reproduce the issue using only primitive types targettingx86_64-pc-windows-msvc
.Here are the results of benchmarks from selected
nightly
versions from the past 18 months. The benchmarks were run using the code here.test tests::bench_fft_manual ... bench: 5 ns/iter (+/- 1)
test tests::bench_fft_manual ... bench: 4 ns/iter (+/- 2)
test tests::bench_fft_manual ... bench: 5 ns/iter (+/- 1)
(with
#![feature(const_slice_len)]
)test tests::bench_fft_manual ... bench: 5 ns/iter (+/- 1)
I took a look at the output of
cargo asm
for each version. The output offft
has 28 branch instructions, while the output offft_manual
has none. I guess the compiler is able to completely unroll the loop when it's written manually, but not when it's written usingIterator
s.Unfortunately, this is about the limit of my debugging ability currently. With some guidance, I might be able to do more of the leg work on my own.
Is this a known/expected performance difference? I hope not. I was really impressed at the elegance of the
Iterator
-based approach. I would be disappointed if the performance is really that bad.