cbiffle commented 7 months ago

Summary

Hi! We've been seeing some surprising codegen decisions in Hubris programs for a while, and I wanted to surface it upstream in case there's anything to be done.

We build primarily at opt-level="z", though many of these behaviors seem to happen at opt-level="s" too. (z code is very slightly smaller.)

There are a handful of routines in core that rustc often decides not to inline which have significant knock-on effects on LLVM's ability to reason about panics, integer ranges for bounds checks, and the like. The problem primarily arises with:

<Enumerate<core::slice::iter::Iter<T>> as Iterator>>::next - not inlining this appears to hamper the ability of other parts of the compiler to notice that the enumerated indices it's producing are in a predictable range
<Index<usize> for [T]>::index or <Index<Range<usize>> for [T]>::index - the non-inlined version of this does an unconditional bounds check, when in many cases there is information available at the callsite to eliminate it were it inlined
More rarely, but often enough, <core::slice::Iter<T> as Iterator>::next

All of these routines, when inlined, tend to produce either one or zero machine instructions -- one in the case of index which is typically an indexed addressing mode load, zero in the case of Enumerate::next where the information is often already available in the calling frame for other reasons, and generally one in the case of slice::Iter::next.

A concrete example from 1.75.0-ish

Here is a concrete example from one of our current programs.

The originating Rust code is (simplified):

        let mut argsbuf = [0; 8];  // length actually a constant but I've inlined it
        // ... things happen ...
        let arglen = an_inlined_operation();
        // ... things happen ...
        &argsbuf[..arglen]

Here's the ARMv7E-M code as produced by objdump from the output of nightly-2023-12-21 (chosen to be approximately equivalent to 1.75.0).

 80071ee:       2601            movs    r6, #1  ; included for reasons I'll explain in a bit
    ; unrelated things happen...
core::array::<impl core::ops::index::Index<I> for [T; N]>::index:
/rustc/5ac4c8a63ee305742071ac6dd11817f7c24adce2/library/core/src/array/mod.rs:348
 8007202:       4640            mov     r0, r8
 8007204:       2108            movs    r1, #8
 8007206:       4632            mov     r2, r6
 8007208:       f000 fb5b       bl      80078c2 <core::slice::index::<impl core::ops::index::Index<I> for [T]>::index>

r6 here holds the length being produced by the inlined operation; along all code paths the length it produces is either 0, 1, or 8, all literal constants. I've included the simplest one here, which sets it to literal 1.

Given that we are indexing a fixed-length 8-element array normally rustc would be very good about eliding the bounds check, since 0, 1, and 8 are all valid options here.

Instead we have the argument setup and a call to the non-inlined index function, which contains a bounds check and conditional panic:

080078c2 <core::slice::index::<impl core::ops::index::Index<I> for [T]>::index>:
 80078c2:       428a            cmp     r2, r1
 80078c4:       bf9c            itt     ls
 80078c6:       4611            movls   r1, r2
 80078c8:       4770            bxls    lr
 80078ca:       b580            push    {r7, lr}
 80078cc:       466f            mov     r7, sp
 80078ce:       4610            mov     r0, r2
 80078d0:       f7ff ff50       bl      8007774 <core::slice::index::slice_end_index_len_fail>

Impact

In many of our programs, this has a pretty significant impact. In at least one case, opt-level=3 winds up being smaller because of more aggressive inlining causing checks to be elided. Because panic sites are relatively expensive -- we format and record panic messages, so they pull in formatting machinery that might not otherwise be needed -- in programs that otherwise contain no panics, a shift in the inlining decisions around something like index can cause a program's size to grow by 4 kiB pretty easily.

For context, that's a 50-100% size increase on most of our programs.

A possibly clumsy solution

So, for the specific cases I described above -- indexing arrays with usizes or ranges of usizes, and core::slice::Iter -- does it ever make sense to not inline these routines? I'm wondering if folks would be open to me going through and carefully applying some inline(always) attributes in this neighborhood. (This might not be sufficient for the Enumerate<T> case but could definitely be applied to the array/slice Index impls.)

If so, let me know what tests you'd like me to perform on the result. I would obviously measure the impact on various Hubris-based firmware, but perhaps there are other contexts where this could be a bad thing that you'd want me to check.

My LLVM experience is limited and rusty, so there might be a cleverer thing that could be done in the compiler, but I wouldn't know what to suggest.

rustc version

The examples above were generated by:

rustc 1.76.0-nightly (5ac4c8a63 2023-12-20)
binary: rustc
commit-hash: 5ac4c8a63ee305742071ac6dd11817f7c24adce2
commit-date: 2023-12-20
host: x86_64-unknown-linux-gnu
release: 1.76.0-nightly
LLVM version: 17.0.6

but I've been observing this behavior consistently since at least 2021, and just wasn't sure if an upstream bug report made sense.

workingjubilee commented 7 months ago

So, for the specific cases I described above -- indexing arrays with usizes or ranges of usizes, and core::slice::Iter -- does it ever make sense to not inline these routines? I'm wondering if folks would be open to me going through and carefully applying some inline(always) attributes in this neighborhood. (This might not be sufficient for the Enumerate case but could definitely be applied to the array/slice Index impls.)

If so, let me know what tests you'd like me to perform on the result. I would obviously measure the impact on various Hubris-based firmware, but perhaps there are other contexts where this could be a bad thing that you'd want me to check.

@cbiffle While I would first check to see if anyone else has tried something like this in another PR first... yes, we are generally open to people opening such PRs! The main question is whether they do something like "regress compiler performance" or such, so they generally get thrown against rustc-perf first to examine that. It's also not-atypical for inline(always) to be the "wrong answer" when considered across all possible types if it leads to over-inlining, but for many extremely trivial functions in libcore it's often worth checking that assumption, yeah.

In any case, ideally they come with a codegen test (see tests/codegen and codegen-tests in the rustc-dev-guide) that verifies that the emitted IR has the properties you want for the code you wanted optimized. That way whatever is implemented for the optimization can get removed later hypothetically if LLVM ever learns to Just Do The Right Thing.

workingjubilee commented 7 months ago

@cbiffle Also, a bit obvious but: I strongly recommend investigating the latest behaviors on 1.77 and nightly! There has been a lot of effort, much of it very recent, to make rustc automatically inline more code by doing so on the MIR level, before LLVM has a chance to make decisions (and thus, possibly avoid making good ones, although this sometimes proves too over-eager). And it'd be great to land codegen or assembly tests even if this behavior is fixed for you.

kevinmehall commented 7 months ago

I've also encountered similar, and it still persists on nightly. It seems worse on opt-level=z than opt-level=s, but in my case the overall code size is smaller with z despite the bloat in certain cases like this.

Compiler Explorer reproducer

saethlin commented 7 months ago

I think the only solution to problems like this is #[inline(always)]. That has the deeply unfortunate side effect of producing inlining in unoptimized builds (so perversely fixing this will probably degrade compile times for unoptimized builds, yes this behavior is deliberate, yes I hate it deeply), but I don't think we have a better tool for telling LLVM that no actually its inlining heuristics are wrong for this function and we know better.

In at least one case, opt-level=3 winds up being smaller because of more aggressive inlining causing checks to be elided.

This is decently well-known, and is the reason I and some others have been surprised at the continuing popularity of opt-level=z in embedded.

Anyone working on this should be aware of the unfortunate interaction with ISPCCP and codegen tests, I'm quite sure that our current and only opt-level=z test is broken because of it: https://github.com/rust-lang/rust/pull/119878#issuecomment-1955241667

kevinmehall commented 7 months ago

@saethlin's recent PR https://github.com/rust-lang/rust/pull/120863 added #[inline(always)] to <Range<usize> as SliceIndex<[T]>>::index and <RangeTo<usize> as SliceIndex<[T]>>::index but not index_mut, explaining the difference between the & and &mut slices in my example on nightly.

cbiffle commented 7 months ago

@workingjubilee - I can confirm @kevinmehall's observation that the behavior mostly continues on nightly. The Index impls for slices are more reliably inlined (thanks @saethlin!) but several of my other bugbears, such as Enumerate, are not.

(This has the exciting side effect that appeasing clippy, by using enumerate instead of iterating over a range and indexing, actually makes the code bigger and performance worse!)

re: opt-level=3 being smaller sometimes:

This is decently well-known, and is the reason I and some others have been surprised at the continuing popularity of opt-level=z in embedded.

Well, the "sometimes" here is the problem. We currently have images that will fit in a system with z that will not fit with 3. I haven't done a deep analysis here recently but last I looked, it was mostly a result of aggressive loop unrolling undoing some attempts at size optimization. These are the sorts of things I'd have hit with pragmas to control unrolling in C, but as far as I'm aware we have no direct equivalent in rustc.

Side note: what I really want is the ability to provide an "inlining hints" file for a given application that applies the equivalent of inline(always) / inline(never) from outside the source tree, in response to observed compiler behavior. This way my weird constraints don't need to affect e.g. the compile performance of debug builds. Yeah, this would be fragile across compiler versions, but we update the toolchain approximately annually because a lot of things related to binary size are fragile across compiler versions. Anyway, end of side note.

cbiffle commented 7 months ago

Anyone working on this should be aware of the unfortunate interaction with ISPCCP and codegen tests

Thanks for the reference on ISPCCP, I was unfamiliar with it, and it explains several cases where my 15-year-old mental model of LLVM's behavior wasn't matching what I was seeing.

workingjubilee commented 7 months ago

@cbiffle Well, you can still pass llvm-args to the backend.

One thing that might be interesting to do, that a few people suggested long ago but that hasn't really been brought up since, is to pass special instructions to the pass manager to cope with the fact that Rust generates a lot of IR yet also benefits a lot from inlining, potentially much more on both counts than, say, clang. Given that LLVM recently-ish changed their pass manager interface a lot (and by recently I mean multiple versions ago For Good), it might benefit significantly from someone taking a whack at that, especially for -Copt-level=z.

saethlin commented 7 months ago

@cbiffle One thing that's really missing here is that we have no tests at all for the code size of realistic codebases produced by -Copt-level=z. If we were to add Hubris or lilos to rustc-perf (which reports artifact sizes in addition to build time), would you expect that to be a decent signal of how effective various tuning of the standard library or compiler is at providing the optimizations that help you?

the8472 commented 7 months ago

Could cranelift be more easily tweaked for code size? In my naive understanding the egraph concept seems more amenable to the iterative inline, optimize, check size, repeat ... pick the best result approach that would be difficult for LLVM.

Side note: what I really want is the ability to provide an "inlining hints" file for a given application that applies the equivalent of inline(always) / inline(never) from outside the source tree, in response to observed compiler behavior.

Does PGO help? It may not be designed to help with size, but for iterators it might help coincidentally since they also tend to be hot.

workingjubilee commented 7 months ago

@the8472 Currently, Cranelift is somewhat moot in this case: it does not support 32-bit Arm.

cbiffle commented 7 months ago

@the8472 - unfortunately I have no idea how to apply PGO to a bare-metal embedded platform.

kevinmehall commented 7 months ago

IIRC Cranelift doesn't do inlining (only MIR inlining) and considers most interprocedural optimizations out of scope. But yes, it seems like egraphs could be much better at this.

Is PGO feasible on Cortex-M? Instrumented build might be too much overhead to even fit on the device, and would need a way to get the profile back to the host. Perhaps something like ARM ETM could capture an execution trace and post-process it into a profile, but it doesn't seem like anyone has implemented such a thing.

bjorn3 commented 7 months ago

Could cranelift be more easily tweaked for code size?

Cranelift doesn't support optimizing away loads and stores beyond very limited cases. And doesn't support inlining and other inter-procedural optimizations. As such it is almost certainly going to result in larger binaries than LLVM.

bjorn3 commented 7 months ago

For Hubris the OS image consists of a bunch of independent executables, right? Would it be possible to share some functions using dylib like functionality? You could put the "dylib" at a fixed address to avoid all complications of a dynamic linker if you make sure that the "dylib" doesn't reference the main executable. You only need to ensure that the "dylib" is mapped at the expected address by the kernel. Implementing this in a way that doesn't require a (relatively) huge dylib covering the entirety of libcore and liballoc including all unused functions may be non-trivial though. Maybe you can compile the entire system once, keep a record of all actually used symbols from the dylib and then recompile the dylib with a version script making only those symbols public + --gc-sections to discard the rest and then recompile all executables to update the expected locations of all symbols. With -Zdylib-lto you can separately apply LTO to the dylib and the main executable. This all is probably a lot more fragile than making LLVM produce smaller executables though.

cbiffle commented 7 months ago

For Hubris the OS image consists of a bunch of independent executables, right? Would it be possible to share some functions using dylib like functionality?

Bit off topic, but -- we looked into that early on, and the rustc/LLVM support for RWPI (read-write position independence), so that we could share the read-only text and not any read-write data segments, wasn't there yet. (Dylibs on Big Computers tend to assume things like a fixed offset from text to data, or that all data is next to each other, both of which are problematic for our use case.) I haven't seen signs that RWPI is stable so far. We're comfortable with our current approach for now.

bjorn3 commented 7 months ago

Dylibs on Big Computers tend to assume things like a fixed offset from text to data, or that all data is next to each other, both of which are problematic for our use case.

Libcore and liballoc only have read-only data afaik, so sharing all data too for the dylib shouldn't be a problem, right? You could add more crates to the dylib if you make sure they don't have any mutable data either.

cbiffle commented 7 months ago

Alright, after some further analysis on programs here, I think as of current nightly, my main foe is <Enumerate<T> as Iterator>::next. Failing to inline it causes a checked add (inside the function) and then often a bounds check or other check (in the loop where the information about the range of the index is hidden).

Enumerate is not actually special, so I'm going to evaluate the potential change by copying it into a crate and festooning it with inline attributes and seeing what happens. (Currently all problematic calls to enumerate are in code I control, so I can do this.)

cbiffle commented 7 months ago

In case anyone wonders about that and tries it themselves:

Yes, you can write your own Enumerate outside core and set it to inline aggressively and have it reliably go away at compile time. In our case, we then hit FilterMap, which in turn hits FlatMap, until eventually you've copied the entire iterator machinery. This is problematic because the internal implementation of the iterator machinery uses a bunch of unstable features. So in the end, I decided against this approach.

It turns out that, annoyingly, (0..array.len()).zip(&array) is significantly more efficient with the current compiler than array.iter().enumerate(), so there are now some places I've snuck that in.

It's off the topic of my original post but I'm now looking at f32::clamp. This has stopped inlining reliably recently, and that's a problem because it contains a panic that pulls in all of the floating point formatting code. I think there's a potentially strong argument for inline(always) on f32::clamp so that may be the new direction.

the8472 commented 7 months ago

It's off the topic of my original post but I'm now looking at f32::clamp. This has stopped inlining reliably recently, and that's a problem because it contains a panic that pulls in all of the floating point formatting code.

Maybe those paths chould be outlined and put into #[cold], that'd help with inlining but not with code size. But maybe there's a reason why this is not the case already. cc @m-ou-se

In the meantime using build-std + panic_immediate_abort might help.

workingjubilee commented 7 months ago

It turns out that, annoyingly, (0..array.len()).zip(&array) is significantly more efficient with the current compiler than array.iter().enumerate(), so there are now some places I've snuck that in.

That's hilarious.

Anyways thanks for noting all this, @cbiffle, one of the things we've noticed is that we apparently don't enable a number of MIR opts with -Copt-level=s or -Copt-level=z that are probably critical for reducing code size. We probably mostly need to figure out a good testing setup to ensure that enabling them is mostly improvements and not regressions.

rust-lang / rust

Not inlining array/slice access and iteration at opt-level=z causes knock-on effects on size #123345

Summary

A concrete example from 1.75.0-ish

Impact

A possibly clumsy solution

rustc version