Open cbiffle opened 8 months ago
So, for the specific cases I described above -- indexing arrays with usizes or ranges of usizes, and core::slice::Iter -- does it ever make sense to not inline these routines? I'm wondering if folks would be open to me going through and carefully applying some inline(always) attributes in this neighborhood. (This might not be sufficient for the Enumerate
case but could definitely be applied to the array/slice Index impls.) If so, let me know what tests you'd like me to perform on the result. I would obviously measure the impact on various Hubris-based firmware, but perhaps there are other contexts where this could be a bad thing that you'd want me to check.
@cbiffle While I would first check to see if anyone else has tried something like this in another PR first... yes, we are generally open to people opening such PRs! The main question is whether they do something like "regress compiler performance" or such, so they generally get thrown against rustc-perf first to examine that. It's also not-atypical for inline(always)
to be the "wrong answer" when considered across all possible types if it leads to over-inlining, but for many extremely trivial functions in libcore it's often worth checking that assumption, yeah.
In any case, ideally they come with a codegen test (see tests/codegen and codegen-tests in the rustc-dev-guide) that verifies that the emitted IR has the properties you want for the code you wanted optimized. That way whatever is implemented for the optimization can get removed later hypothetically if LLVM ever learns to Just Do The Right Thing.
@cbiffle Also, a bit obvious but: I strongly recommend investigating the latest behaviors on 1.77 and nightly! There has been a lot of effort, much of it very recent, to make rustc automatically inline more code by doing so on the MIR level, before LLVM has a chance to make decisions (and thus, possibly avoid making good ones, although this sometimes proves too over-eager). And it'd be great to land codegen or assembly tests even if this behavior is fixed for you.
I've also encountered similar, and it still persists on nightly. It seems worse on opt-level=z than opt-level=s, but in my case the overall code size is smaller with z
despite the bloat in certain cases like this.
I think the only solution to problems like this is #[inline(always)]
. That has the deeply unfortunate side effect of producing inlining in unoptimized builds (so perversely fixing this will probably degrade compile times for unoptimized builds, yes this behavior is deliberate, yes I hate it deeply), but I don't think we have a better tool for telling LLVM that no actually its inlining heuristics are wrong for this function and we know better.
In at least one case, opt-level=3 winds up being smaller because of more aggressive inlining causing checks to be elided.
This is decently well-known, and is the reason I and some others have been surprised at the continuing popularity of opt-level=z
in embedded.
Anyone working on this should be aware of the unfortunate interaction with ISPCCP and codegen tests, I'm quite sure that our current and only opt-level=z
test is broken because of it: https://github.com/rust-lang/rust/pull/119878#issuecomment-1955241667
@saethlin's recent PR https://github.com/rust-lang/rust/pull/120863 added #[inline(always)]
to <Range<usize> as SliceIndex<[T]>>::index
and <RangeTo<usize> as SliceIndex<[T]>>::index
but not index_mut
, explaining the difference between the &
and &mut
slices in my example on nightly.
@workingjubilee - I can confirm @kevinmehall's observation that the behavior mostly continues on nightly. The Index
impls for slices are more reliably inlined (thanks @saethlin!) but several of my other bugbears, such as Enumerate
, are not.
(This has the exciting side effect that appeasing clippy, by using enumerate instead of iterating over a range and indexing, actually makes the code bigger and performance worse!)
re: opt-level=3 being smaller sometimes:
This is decently well-known, and is the reason I and some others have been surprised at the continuing popularity of opt-level=z in embedded.
Well, the "sometimes" here is the problem. We currently have images that will fit in a system with z
that will not fit with 3
. I haven't done a deep analysis here recently but last I looked, it was mostly a result of aggressive loop unrolling undoing some attempts at size optimization. These are the sorts of things I'd have hit with pragmas to control unrolling in C, but as far as I'm aware we have no direct equivalent in rustc.
Side note: what I really want is the ability to provide an "inlining hints" file for a given application that applies the equivalent of inline(always)
/ inline(never)
from outside the source tree, in response to observed compiler behavior. This way my weird constraints don't need to affect e.g. the compile performance of debug builds. Yeah, this would be fragile across compiler versions, but we update the toolchain approximately annually because a lot of things related to binary size are fragile across compiler versions. Anyway, end of side note.
Anyone working on this should be aware of the unfortunate interaction with ISPCCP and codegen tests
Thanks for the reference on ISPCCP, I was unfamiliar with it, and it explains several cases where my 15-year-old mental model of LLVM's behavior wasn't matching what I was seeing.
@cbiffle Well, you can still pass llvm-args to the backend.
One thing that might be interesting to do, that a few people suggested long ago but that hasn't really been brought up since, is to pass special instructions to the pass manager to cope with the fact that Rust generates a lot of IR yet also benefits a lot from inlining, potentially much more on both counts than, say, clang. Given that LLVM recently-ish changed their pass manager interface a lot (and by recently I mean multiple versions ago For Good), it might benefit significantly from someone taking a whack at that, especially for -Copt-level=z
.
@cbiffle One thing that's really missing here is that we have no tests at all for the code size of realistic codebases produced by -Copt-level=z
. If we were to add Hubris or lilos to rustc-perf (which reports artifact sizes in addition to build time), would you expect that to be a decent signal of how effective various tuning of the standard library or compiler is at providing the optimizations that help you?
Could cranelift be more easily tweaked for code size? In my naive understanding the egraph concept seems more amenable to the iterative inline, optimize, check size, repeat ... pick the best result approach that would be difficult for LLVM.
Side note: what I really want is the ability to provide an "inlining hints" file for a given application that applies the equivalent of inline(always) / inline(never) from outside the source tree, in response to observed compiler behavior.
Does PGO help? It may not be designed to help with size, but for iterators it might help coincidentally since they also tend to be hot.
@the8472 Currently, Cranelift is somewhat moot in this case: it does not support 32-bit Arm.
@the8472 - unfortunately I have no idea how to apply PGO to a bare-metal embedded platform.
IIRC Cranelift doesn't do inlining (only MIR inlining) and considers most interprocedural optimizations out of scope. But yes, it seems like egraphs could be much better at this.
Is PGO feasible on Cortex-M? Instrumented build might be too much overhead to even fit on the device, and would need a way to get the profile back to the host. Perhaps something like ARM ETM could capture an execution trace and post-process it into a profile, but it doesn't seem like anyone has implemented such a thing.
Could cranelift be more easily tweaked for code size?
Cranelift doesn't support optimizing away loads and stores beyond very limited cases. And doesn't support inlining and other inter-procedural optimizations. As such it is almost certainly going to result in larger binaries than LLVM.
For Hubris the OS image consists of a bunch of independent executables, right? Would it be possible to share some functions using dylib like functionality? You could put the "dylib" at a fixed address to avoid all complications of a dynamic linker if you make sure that the "dylib" doesn't reference the main executable. You only need to ensure that the "dylib" is mapped at the expected address by the kernel. Implementing this in a way that doesn't require a (relatively) huge dylib covering the entirety of libcore and liballoc including all unused functions may be non-trivial though. Maybe you can compile the entire system once, keep a record of all actually used symbols from the dylib and then recompile the dylib with a version script making only those symbols public + --gc-sections
to discard the rest and then recompile all executables to update the expected locations of all symbols. With -Zdylib-lto
you can separately apply LTO to the dylib and the main executable. This all is probably a lot more fragile than making LLVM produce smaller executables though.
For Hubris the OS image consists of a bunch of independent executables, right? Would it be possible to share some functions using dylib like functionality?
Bit off topic, but -- we looked into that early on, and the rustc/LLVM support for RWPI (read-write position independence), so that we could share the read-only text and not any read-write data segments, wasn't there yet. (Dylibs on Big Computers tend to assume things like a fixed offset from text to data, or that all data is next to each other, both of which are problematic for our use case.) I haven't seen signs that RWPI is stable so far. We're comfortable with our current approach for now.
Dylibs on Big Computers tend to assume things like a fixed offset from text to data, or that all data is next to each other, both of which are problematic for our use case.
Libcore and liballoc only have read-only data afaik, so sharing all data too for the dylib shouldn't be a problem, right? You could add more crates to the dylib if you make sure they don't have any mutable data either.
Alright, after some further analysis on programs here, I think as of current nightly, my main foe is <Enumerate<T> as Iterator>::next
. Failing to inline it causes a checked add (inside the function) and then often a bounds check or other check (in the loop where the information about the range of the index is hidden).
Enumerate
is not actually special, so I'm going to evaluate the potential change by copying it into a crate and festooning it with inline attributes and seeing what happens. (Currently all problematic calls to enumerate
are in code I control, so I can do this.)
In case anyone wonders about that and tries it themselves:
Yes, you can write your own Enumerate
outside core
and set it to inline aggressively and have it reliably go away at compile time. In our case, we then hit FilterMap
, which in turn hits FlatMap
, until eventually you've copied the entire iterator machinery. This is problematic because the internal implementation of the iterator machinery uses a bunch of unstable features. So in the end, I decided against this approach.
It turns out that, annoyingly, (0..array.len()).zip(&array)
is significantly more efficient with the current compiler than array.iter().enumerate()
, so there are now some places I've snuck that in.
It's off the topic of my original post but I'm now looking at f32::clamp
. This has stopped inlining reliably recently, and that's a problem because it contains a panic that pulls in all of the floating point formatting code. I think there's a potentially strong argument for inline(always)
on f32::clamp
so that may be the new direction.
It's off the topic of my original post but I'm now looking at f32::clamp. This has stopped inlining reliably recently, and that's a problem because it contains a panic that pulls in all of the floating point formatting code.
Maybe those paths chould be outlined and put into #[cold]
, that'd help with inlining but not with code size. But maybe there's a reason why this is not the case already. cc @m-ou-se
In the meantime using build-std + panic_immediate_abort might help.
It turns out that, annoyingly, (0..array.len()).zip(&array) is significantly more efficient with the current compiler than array.iter().enumerate(), so there are now some places I've snuck that in.
That's hilarious.
Anyways thanks for noting all this, @cbiffle, one of the things we've noticed is that we apparently don't enable a number of MIR opts with -Copt-level=s
or -Copt-level=z
that are probably critical for reducing code size. We probably mostly need to figure out a good testing setup to ensure that enabling them is mostly improvements and not regressions.
Summary
Hi! We've been seeing some surprising codegen decisions in Hubris programs for a while, and I wanted to surface it upstream in case there's anything to be done.
We build primarily at
opt-level="z"
, though many of these behaviors seem to happen atopt-level="s"
too. (z
code is very slightly smaller.)There are a handful of routines in
core
that rustc often decides not to inline which have significant knock-on effects on LLVM's ability to reason about panics, integer ranges for bounds checks, and the like. The problem primarily arises with:<Enumerate<core::slice::iter::Iter<T>> as Iterator>>::next
- not inlining this appears to hamper the ability of other parts of the compiler to notice that the enumerated indices it's producing are in a predictable range<Index<usize> for [T]>::index
or<Index<Range<usize>> for [T]>::index
- the non-inlined version of this does an unconditional bounds check, when in many cases there is information available at the callsite to eliminate it were it inlined<core::slice::Iter<T> as Iterator>::next
All of these routines, when inlined, tend to produce either one or zero machine instructions -- one in the case of
index
which is typically an indexed addressing mode load, zero in the case ofEnumerate::next
where the information is often already available in the calling frame for other reasons, and generally one in the case ofslice::Iter::next
.A concrete example from 1.75.0-ish
Here is a concrete example from one of our current programs.
The originating Rust code is (simplified):
Here's the ARMv7E-M code as produced by objdump from the output of
nightly-2023-12-21
(chosen to be approximately equivalent to 1.75.0).r6
here holds the length being produced by the inlined operation; along all code paths the length it produces is either 0, 1, or 8, all literal constants. I've included the simplest one here, which sets it to literal 1.Given that we are indexing a fixed-length 8-element array normally rustc would be very good about eliding the bounds check, since 0, 1, and 8 are all valid options here.
Instead we have the argument setup and a call to the non-inlined index function, which contains a bounds check and conditional panic:
Impact
In many of our programs, this has a pretty significant impact. In at least one case,
opt-level=3
winds up being smaller because of more aggressive inlining causing checks to be elided. Because panic sites are relatively expensive -- we format and record panic messages, so they pull in formatting machinery that might not otherwise be needed -- in programs that otherwise contain no panics, a shift in the inlining decisions around something likeindex
can cause a program's size to grow by 4 kiB pretty easily.For context, that's a 50-100% size increase on most of our programs.
A possibly clumsy solution
So, for the specific cases I described above -- indexing arrays with usizes or ranges of usizes, and
core::slice::Iter
-- does it ever make sense to not inline these routines? I'm wondering if folks would be open to me going through and carefully applying someinline(always)
attributes in this neighborhood. (This might not be sufficient for theEnumerate<T>
case but could definitely be applied to the array/slice Index impls.)If so, let me know what tests you'd like me to perform on the result. I would obviously measure the impact on various Hubris-based firmware, but perhaps there are other contexts where this could be a bad thing that you'd want me to check.
My LLVM experience is limited and rusty, so there might be a cleverer thing that could be done in the compiler, but I wouldn't know what to suggest.
rustc version
The examples above were generated by:
but I've been observing this behavior consistently since at least 2021, and just wasn't sure if an upstream bug report made sense.