Iterator-based approach performs 10x worse than manual implementation

bradleyharden commented 3 years ago

I'm not sure if this qualifies as a bug or not.

In short, the following two functions give the same output, but fft performs about 10x worse than fft_manaul.

const LUT: [i16; 8] = [0, 1, 2, 3, 4, 5, 6, 7];

pub fn fft(samples: &[i16; 16]) -> (i32, i32) {
    let sin = LUT.iter().cycle();
    let cos = LUT.iter().cycle().skip(LUT.len() / 4);
    sin.zip(cos).zip(samples).fold(
        (0, 0), |(mut real, mut imag), ((sin, cos), &sample)| {
            real += sample as i32 * (*cos as i32);
            imag += sample as i32 * (*sin as i32);
            (real, imag)
    })
}

const LUT: [i16; 8] = [0, 1, 2, 3, 4, 5, 6, 7];

pub fn fft_manual(samples: &[i16; 16]) -> (i32, i32) {
    const LUT_LEN: usize = LUT.len();
    let mut real = 0;
    let mut imag = 0;
    for i in 0..(samples.len() / LUT_LEN) {
        for j in 0..LUT_LEN {
            let sin = LUT[j];
            let cos = LUT[(j + (LUT_LEN / 4)) % LUT_LEN];
            let sample = samples[i * LUT_LEN + j];
            real += sample as i32 * (cos as i32);
            imag += sample as i32 * (sin as i32);
        }
    }
    (real, imag)
}

I originally ran into this in an embedded context, targeting thumbv7em-none-eabihf. The original function took a GenericArray and used fixed-point types from the fixed crate, but I was able to reproduce the issue using only primitive types targetting x86_64-pc-windows-msvc.

Here are the results of benchmarks from selected nightly versions from the past 18 months. The benchmarks were run using the code here.

Version	Results
nightly-2020-12-21	test tests::bench_fft ... bench: 63 ns/iter (+/- 16) test tests::bench_fft_manual ... bench: 5 ns/iter (+/- 1)
nightly-2020-06-01	test tests::bench_fft ... bench: 36 ns/iter (+/- 15) test tests::bench_fft_manual ... bench: 4 ns/iter (+/- 2)
nightly-2020-01-02	test tests::bench_fft ... bench: 45 ns/iter (+/- 20) test tests::bench_fft_manual ... bench: 5 ns/iter (+/- 1)
nightly-2019-06-01 (with `#![feature(const_slice_len)]`)	test tests::bench_fft ... bench: 52 ns/iter (+/- 23) test tests::bench_fft_manual ... bench: 5 ns/iter (+/- 1)

I took a look at the output of cargo asm for each version. The output of fft has 28 branch instructions, while the output of fft_manual has none. I guess the compiler is able to completely unroll the loop when it's written manually, but not when it's written using Iterators.

Unfortunately, this is about the limit of my debugging ability currently. With some guidance, I might be able to do more of the leg work on my own.

Is this a known/expected performance difference? I hope not. I was really impressed at the elegance of the Iterator-based approach. I would be disappointed if the performance is really that bad.

leonardo-m commented 3 years ago

Perhaps it's because those cycle()?

the8472 commented 3 years ago

Unfortunately, this is about the limit of my debugging ability currently. With some guidance, I might be able to do more of the leg work on my own.

Try compiling benchmarks with debuginfo = 2 and then running them under perf record --call-graph dwarf and perf report (or some GUI equivalent). That can get you the assembly + corresponding source lines + whole call tree including inlining + information which instructions take most of the time. The usefulness of source annotation varies though due to the heavy optimizations.

Perhaps it's because those cycle()?

Possible. zip has a specialized implementation that applies when both iterators implement TrustedRandomAccess, which Cycle currently doesn't but probably could if the underlying one does.

Edit: nevermind, it can't implement that because TRA must have a finite size, which Cycle doesn't have.

JulianKnodt commented 3 years ago

I've found zip to be the main cause for slowness, because it has to check whenever the first item terminates, and has trouble reasoning about the length of iterators

bradleyharden commented 3 years ago

@JulianKnodt and @the8472, if zip + cycle is the likely culprit and there's nothing that can done to improve it, is the best solution a manual implementation, like I did? Is it still worth it for me to dig deeper into debugging?

JulianKnodt commented 3 years ago

This is a purely personal POV, so take it as you will, but I think certainly the manual implementation is fine and it's not always worth making very neat abstractions due to increased burden on the compiler. That being said, there are probably spectializations that can be added to make zip & cycle better, so if you're really focusing on implementing FFT, you probably want to stick to the manual impl, but if you're doing it as an exercise it could be good to look at the library implementations of those and see if anything can be added to speed it up.

bradleyharden commented 3 years ago

For the particular use case at hand, I think it probably makes sense to stick with the manual implementation. I need to calculate the FFT coefficients for a single frequency in an embedded context with a real-time deadline.

However, before I discovered the poor performance, I remember being impressed by this specific iteration example, because I was able express the entire problem in terms of iterator adapters (cycle, skip and zip). I was excited to share it with some of my coworkers as a great example of the zero-cost abstractions that Rust could offer. But then I found out it's not so zero-cost, which is disappointing.

I might look into it more, but I don't know that I really have the expertise to improve the implementation. I would think these issues have already been considered in depth by people with a lot more experience in compilers.

leonardo-m commented 3 years ago

I would think these issues have already been considered in depth by people with a lot more experience in compilers.

Very smart and dedicated people have worked on the stdlib, but the stdlib isn't perfect nor complete. If you see a possible hole, then see if it's worth filling, if you have time to do it :-) Perhaps cycle+zip is worth improving.

bradleyharden commented 3 years ago

I played around with the example just now and found something interesting. Apparently zip and cycle are not the problem, it's actually skip. Removing skip, or giving it an argument of 0, is enough to erase the performance differences.

Based on that, does anyone have any immediate insight into what the problem might be?

bradleyharden commented 3 years ago

Looks like it's the interaction between cycle and skip. Interesting.

This compiles to a constant:

pub fn skip_manual() -> i16 {
    let mut sum = 0;
    for i in 0..LUT_LEN {
        sum += LUT[(i + 2) % LUT_LEN];
    }
    sum
}

But this does not.

pub fn skip() -> i16 {
    LUT.iter().cycle().skip(2).take(8).sum()
}

Yet this does

pub fn skip() -> i16 {
    LUT.iter().skip(2).sum()
}

JulianKnodt commented 3 years ago

Interesting, I think skip + iteration still adds some complexity, but I imagine having skip of 0 means that a bunch of branches can be compiled out from the impl. It might be because cycle doesn't implement any traits like TrustedLen etc, and skip propagates those impls, & zip then uses those for specialization. @the8472 where does it mention that TRA needs finite size? I looked for it briefly but couldn't find it

You can also remove the skip just by iterating thru eagerly and it removes some branches:

pub fn fft(samples: &[i16; 16]) -> (i32, i32) {
    let sin = LUT.iter().cycle();
    let mut cos = LUT.iter().cycle();
    cos.nth(LUT.len() / 4);
    ...
}

bradleyharden commented 3 years ago

Can someone explain what's going on here? On both Godbolt and with cargo-asm, I'm getting strange behavior.

In this example, skip, cycle_take and cycle_nth_take all compile to the same function, which simply returns a constant. However, if you uncomment cycle_skip_take, then cycle_nth_take no longer compiles to a constant. How can that be? Is that some weird artifact of both Godbolt and cargo-asm? Or is it a real bug?

the8472 commented 3 years ago

Another approach, I haven't benchmarked it but on O3 it gives branch-free assembly only consisting of memory and simple arithmetic operations, similar to @JulianKnodt's approach with nth. O2 doesn't look as rosy, producing a loop and two multiplications.

const LUT: [i16; 8] = [0, 1, 2, 3, 4, 5, 6, 7];

pub fn fft(samples: &[i16; 16]) -> (i32, i32) {
    let sin = (0..).map(|i| LUT[i % LUT.len()]);
    let cos = ((LUT.len()/4)..).map(|i| LUT[i % LUT.len()]);
    sin.zip(cos).zip(samples).fold(
        (0, 0), |(mut real, mut imag), ((sin, cos), &sample)| {
            real += sample as i32 * (cos as i32);
            imag += sample as i32 * (sin as i32);
            (real, imag)
    })
}

In this case skip can actually be used for cos, at least on O3.

@JulianKnodt

@the8472 where does it mention that TRA needs finite size? I looked for it briefly but couldn't find it

https://github.com/rust-lang/rust/blob/32da90b431919eedb3e281a91caea063ba4edb77/library/core/src/iter/adapters/zip.rs#L386

It can only be exact (upper bound == lower bound) if an upper bound exists which means it needs to fit into usize.

@bradleyharden

How can that be? Is that some weird artifact of both Godbolt and cargo-asm? Or is it a real bug?

I would guess that llvm is simply making different inlining decisions based on the number of calls to one of the functions in the iterator chain and duplicating parts of that code goes over a threshold. Iterator optimizations are heavily dependent on inlining.

bradleyharden commented 3 years ago

I would guess that llvm is simply making different inlining decisions based on the number of calls to one of the functions in the iterator chain and duplicating parts of that code goes over a threshold. Iterator optimizations are heavily dependent on inlining.

@the8472, I just want to clarify, in case there was any miscommunication. In that example the same function gives two different results based on the presence or absence of a different function. It doesn't seem like inlining could explain that. It seems more like it's skipping an optimization pass when another function is present, or something.

marmeladema commented 3 years ago

@the8472, I just want to clarify, in case there was any miscommunication. In that example the same function gives two different results based on the presence or absence of a different function. It doesn't seem like inlining could explain that. It seems more like it's skipping an optimization pass when another function is present, or something.

That seems very similar to a situation i've encountered when debugging https://github.com/rust-lang/rust/issues/79246 Some code was not inlined in the same way depending of just the number of symbols present or even to some unrelated variable declared.

bradleyharden commented 3 years ago

@marmeladema, do you have a minimal example of that? Do you think it's worth making a separate issue about it? I can add my example from above.

bradleyharden commented 3 years ago

@the8472, I think the skip/cycle interaction is worth investigating more. But for now, I wanted to report that moving to nth was able to resolve my issue in the actual embedded context. Previously, the iterator approach was taking 1175 clock cycles, compared to 141 for the manual approach. After switching from skip to nth, both approaches now give 141 clock cycles.

pub fn fft<N>(samples: &GenericArray<i16, Prod<N, LutLen>>) -> (I16F16, I16F16)
where
    N: Mul<LutLen> + Unsigned,
    Prod<N, LutLen>: ArrayLength<i16>,
{
    const ZERO: I16F16 = I16F16::from_bits(0);
    let sin = LUT.iter().cycle();
    let mut cos = LUT.iter().cycle();
    cos.nth(LUT.len() / 4 - 1); // Use nth not skip, so iterator is optimized
    sin.zip(cos).zip(samples).fold(
        (ZERO, ZERO), |(mut real, mut imag), ((&sin, &cos), &sample)| {
            let sample = I1F15::from_bits(sample);
            real += I16F16::from_num(sample.wide_mul(cos));
            imag += I16F16::from_num(sample.wide_mul(sin));
            (real, imag)
    })
}

the8472 commented 3 years ago

@bradleyharden

I just want to clarify, in case there was any miscommunication. In that example the same function gives two different results based on the presence or absence of a different function. It doesn't seem like inlining could explain that. It seems more like it's skipping an optimization pass when another function is present, or something.

I was talking about inlining of the callees. If you uncomment the function the iterator methods are used in more places which can affect inlining decisions.

JulianKnodt commented 3 years ago

To help demonstrate what was said above #[inline] above fn cycle_nth_take() will compile it to a constant even with fn cycle_skip_take() uncommented, which can be seen if you add a call site for it elsewhere.

Aaron1011 commented 3 years ago

I just want to clarify, in case there was any miscommunication. In that example the same function gives two different results based on the presence or absence of a different function. It doesn't seem like inlining could explain that. It seems more like it's skipping an optimization pass when another function is present, or something.

@bradleyharden I suspect that is caused by the way that codegne unit (CGU) partitioning ends up happening. My guess is that the presence/absence of the other functions ends up changing which functions get placed in the same CGU, which may in turn affect which optimizations LLVM decides to perform.

Can you try setting [profile.release] codegen-units = 1] in your Cargo.toml?

bradleyharden commented 3 years ago

@JulianKnodt, adding #[inline] doesn't make a difference. I think I can explain why you were seeing a difference. If I add #[inline] to both cycle_nth_take() and cycle_skip_take(), then both functions disappear, as expected. If I add a call site for cycle_nth_take() but not for cycle_skip_take(), then it is able to compile it to a constant. But if I then add a call site for cycle_skip_take() as well, I can observe the same behavior; cycle_nth_take() will no longer compile to a constant.

@Aaron1011, adding codegen-units = 1 makes no difference to any of this behavior.

Is this expected behavior? Or should I create a separate issue about it? I'm still a bit skeptical of @the8472's callee inlining explanation. cycle_nth_take() is able to compile to a constant both on it's own and in the presence of skip() and cycle_take(). It's only when cycle_skip_take() is present and compiled that it stops compiling to a constant.

bradleyharden commented 3 years ago

Both .skip() and .cycle() have a branch within their .next() methods. It seems like the compiler is having trouble propagating its compile-time knowledge through these branch points. Here's one idea I had to improve it.

Right now, .skip() stores its argument n as a usize, and it uses 0 as a sentinel value to indicate when it has already skipped ahead. Instead, I thought the compiler might have an easier time if we separated the boolean (need to skip vs. already skipped) from the number of elements to skip. I created a CustomSkip struct that stores the argument as an Option<usize> instead. Here is my implementation:

pub struct CustomSkip<I> {
    iter: I,
    n: Option<usize>,
}

impl<I> CustomSkip<I> {
    pub fn new(iter: I, n: usize) -> CustomSkip<I> {
        CustomSkip { iter, n: Some(n) }
    }
}

impl<I: Iterator> Iterator for CustomSkip<I> {

    type Item = I::Item;

    #[inline]
    fn next(&mut self) -> Option<I::Item> {
        if let Some(n) = self.n.take() {
            self.iter.nth(n)
        } else {
            self.iter.next()
        }
    }
}

trait SkipExt: Iterator + Sized {
    fn custom_skip(self, n: usize) -> CustomSkip<Self> {
        CustomSkip::new(self, n)
    }
}

impl<I: Iterator> SkipExt for I {}

Indeed, this does seem to improve all of the examples discussed so far. But it only improves the original fft example slightly, by less than a factor of 2. And using .nth() on the fft cos iterator still produces much better results than using .skip() or .custom_skip().

the8472 commented 3 years ago

Another approach I have seen in some places in std is unrolling the first loop iteration so that the compiler has an easier time with the loop for the remainder. That could be tried by implementing fold in Zip.

JulianKnodt commented 3 years ago

@bradleyharden do you think you could see if

    #[inline]
    fn next(&mut self) -> Option<I::Item> {
        if unlikely(self.n > 0) {
            self.iter.nth(self.n-1);
            self.n=0;
        }
        self.iter.next()
    }

with n: usize works as well for your case as the n: Option<usize>? The asm looks better than the original.

rust-lang / rust

Iterator-based approach performs 10x worse than manual implementation #80416