Binary size inflation after rustc 1.72

oleid commented 1 week ago

Code

I tried this code (based on some coding competition). I used extern "C" to hide the actual implementation of the functions. But that seems not to affect the codegen. Transition 2 and 3 are internal functions and implementation detail of Transition 1.

#![allow(non_snake_case)]

#[derive(Clone, Copy, PartialEq)]
#[repr(C)]
pub enum State {
    DummyState1,
    DummyState2,
    DummyState3
}

extern "C" {
    fn f1(s: State) -> State;
    fn f2(s: State) -> State;
    fn f3(s: State) -> State;
    fn c1(s: State) -> bool;
    fn c2(s: State) -> bool;
    fn c22(s: State) -> bool;
    fn c3(s: State) -> bool;
}

pub unsafe fn Transition1(mut s: State) -> State {
    s = f1(s);
    if c1(s) {
        Transition3(s)
    } else {
        Transition2(s)
    }
}

unsafe fn Transition2(mut s: State) -> State {
    s = f2(s);
    if c2(s) {
        Transition1(s)
    } else if c22(s) {
        s
    } else {
        Transition3(s)
    }
}

unsafe fn Transition3(mut s: State) -> State {
    s = f3(s);
    if c3(s) {
        Transition2(s)
    } else {
        s
    }
}

C.f. https://godbolt.org/z/hhsx3hd3v

I expected to see this happen:

The size of the generated assembly was about the same as the C version of the same code (https://godbolt.org/z/nM94GKz6j). And about the same as the last working rustc version (v1.72) The C versions is around 60 lines of assembly, the rust version has about 50.

Instead, this happened: explanation

The lines of assembly grow nearly three times in size. So V. 1.72 has ~52 lines, V.1.73 inflated to 150 lines.

Version it worked on

The assembly size is good for rustc v1.72 AND the nightly of rustc-codegen-gcc

Thus, I suspect it has something to do with rustc<->llvm. As I mentioned, clang is fine.

Version with regression

I stepped through the rustc versions on godbold and no version after v.1.72 (up to v.1.79, beta and nightly) has about the previous assembly size. All yield a size ~3 times bigger than the original.

saethlin commented 1 week ago

I used extern "C" to hide the actual implementation of the functions. But that seems not to affect the codegen.

It definitely affects the codegen, but this point is entirely irrelvant.

The call graph here is really complicated; all these functions are recursive at some call depth. Between 1.72 and 1.73, LLVM started doing more inlining. Sometimes this optimization is highly beneficial, but you seem to only care about code size.

So you should use the existing optimization flags to tell the compiler that. If you set -Copt-level=s or -Copt-level=z you will get the optimization you want. https://doc.rust-lang.org/cargo/reference/profiles.html#opt-level

I don't think this qualifies as a bug or a regression but I'll let others weigh in. If you use one of the optimization levels to tell the compiler about your code size preference, there is no regression.

oleid commented 1 week ago

@saethlin I guess what motivated me to report this as an issue is that the corresponding clang output is slot smaller than the output of rust > 1.72. And some embedded code needs to get compiled with -O3 to work properly, at least in the espressif ecosystem. But mostly to make you folks aware, since I cannot decide if this is an rare edge case or happening more often. I figured in the Async context state transitions like these could occur more often.

saethlin commented 1 week ago

some embedded code needs to get compiled with -O3 to work properly, at least in the espressif ecosystem

Is there an issue or issues tracking that? Specific examples would be very informative. I've seen a lot of vague comments like this in the past about embedded development and we'd much prefer to help rather than have people hacking around the compiler, but without you explaining what's going on, we cannot help.

We're not going to make changes to -Copt-level=3 that favor code size unless someone can demonstrate that such changes don't harm runtime performance. The entire point of that flag is to trade off other properties including code size for runtime performance.

workingjubilee commented 1 week ago

Unless "works properly" is about performance of the code, code that "needs to get compiled with -O3 to work properly" is a bug in the compiler or the code in question.

oleid commented 1 week ago

Sadly, I don't have any specific issues - I couldn't find any. But I'll ask the respective developers. The trouble with the espressif ecosystem, however, is that the llvm changes required to build everything are not fully mainline, yet. But maybe they'll get there this year.

We're not going to make changes to -Copt-level=3 that favor code size unless someone can demonstrate that such changes don't harm runtime performance.

That's fair and wasn't what I meant. The issue mentioned here also was alreay observed with -Copt-level=2, for what it is worth.

workingjubilee commented 1 week ago

Sadly, I don't have any specific issues - I couldn't find any. But I'll ask the respective developers. The trouble with the espressif ecosystem, however, is that the llvm changes required to build everything are not fully mainline, yet. But maybe they'll get there this year.

That is only true for the Xtensa targets.

There are non-Xtensa Espressif chips. Further, reasoning about performance of generated code based on lines of assembly of that code translates extremely poorly across architectures, as LLVM has extensive x86-specific optimizations, and many architectures do not support those, neither in their machine backend lowering, nor having the raw architectural features that make x86 code significantly larger or sometimes smaller and much more compact than that of other architectures, (because the smaller code size is paid in higher decoder power consumption, and because the instructions are allowed to be inherently more complex).

workingjubilee commented 1 week ago

Basically I am slightly worried that you have filed a bug report that, even if fixed, may not meaningfully address your real concerns. It would be good to be exceedingly specific what you actually want, even if the report languishes a while waiting on the Xtensa backend being upstreamed.

oleid commented 1 week ago

It would be good to be exceedingly specific what you actually want

I noticed something which might be an issue and wanted you people to be aware of it to be able to decide IF it is an issue. Like I said initially, this is not production code, just something from a coding challenge.

If you decide it is not a problem, then this is fine with me.

tgross35 commented 1 week ago

It does look kind of weird that it outlines Transition2 or Transition3 now (and ignores #[inline(always)] because recursive). Doesn't make any sense to me that adding the function call overhead could somehow be faster. But I have a hunch that in situ (where it can inline Transition1) this goes away. https://godbolt.org/z/zxbeGTba4

@oleid are you seeing anything like this propagating to the final binary?

Also the clang version in your original godbolt link was 8.0, latest is 18.1 so it's not exactly apples to apples (they look reasonably similar though).

oleid commented 1 week ago

For the aforementioned challenge there was no concrete implementation of those functions (so that no further optimization was possible). If I provide one that was suggested I get:

https://godbolt.org/z/8ezYoYorW

There is still a difference in the size of the generated code, but the size increase can IMHO be neglected here.

Without the concrete implementation the size increase is similar than before:

https://godbolt.org/z/5acqqobqo

@tgross35 I hope that answers your question.

After this test, my gut feeling would be to close the regression issue, since typically, we have a concrete implementation and the compiler has a better time generating the code. What do you think?

apiraino commented 1 week ago

WG-prioritization assigning priority (Zulip discussion).

@rustbot label -I-prioritize +P-medium

oleid commented 1 week ago

It would seem Espressif is nowadays using size optimized code generation by default. So considering this seems to be an artefact of LLVM optimization for cases it has not all information (i.e. using external functions) and you're not sure what to do with the issue, I'm closing it.

rust-lang / rust