Closed alexbool closed 7 years ago
The blowup of binary size is the minor difference. The main difference is in compilation times and run-times of that resulting binary, that are quite worse.
cc @alexcrichton @pnkfelix
@alexbool can you enable us to reproduce this? I can investigate in parallel but bug reports tend to be much more readily fixable if there's a way to reproduce what caused the bug report!
Looks like this isn't a regression for "hello world", or likely just "in the noise"
$ rustc +stable -V
rustc 1.18.0 (03fc9d622 2017-06-06)
$ rustc +stable foo.rs && strip foo && ls -alh foo
-rwxrwxr-x 1 alex alex 316K Jun 21 18:47 foo
$ rustc +beta -V
rustc 1.19.0-beta.2 (a175ee509 2017-06-15)
$ rustc +beta foo.rs && strip foo && ls -alh foo
-rwxrwxr-x 1 alex alex 368K Jun 21 18:47 foo
$ rsutc +nightly -V
rustc 1.20.0-nightly (622e7e648 2017-06-21)
$ rustc +nightly foo.rs && strip foo && ls -alh foo
-rwxrwxr-x 1 alex alex 324K Jun 21 18:47 foo
I'll try to make a minimal reproducible example today or later this week
@alexbool any updates on this?
@leonardo-m or @hinaria do y'all have examples of this causing a size increase?
Unfortunately didn't have much time to dig into this. Will try as soon as I have possibility.
Here is a test case that may serve for exhibiting the space increase that @alexbool observed around this time.
(Or at least, on my Mac, building without optimizations, I am observing a size increase of about 1.08x on the stripped binaries. With optimizations, the size increase is about 1.05x)
// vecel.rs
use std::env;
fn main() {
let mut args = env::args();
args.next();
let arg1 = args.next().unwrap();
let num: usize = arg1.parse().unwrap();
let mut v = Vec::new();
for i in 0..num {
v.push(i);
}
assert_eq!(v.len(), num);
}
Build script:
#!/bin/sh # vecel.sh
function f() {
OUT=vecel.$(rustc --version | sed -e 's/(//' -e 's/)//' | cut -d ' ' -f 4).bin ; rustc --version && rustc ~/Dev/Rust/vecel.rs -o $OUT && strip $OUT
}
for day in 19 20 21 22 23 24 25 26 27; do
rustup default nightly-2017-06-$day;
f;
done
CHAN=stable; rustc +$CHAN ~/Dev/Rust/vecel.rs -o vecel.$CHAN.bin && strip vecel.$CHAN.bin
CHAN=beta; rustc +$CHAN ~/Dev/Rust/vecel.rs -o vecel.$CHAN.bin && strip vecel.$CHAN.bin
CHAN=nightly; rustc +$CHAN ~/Dev/Rust/vecel.rs -o vecel.$CHAN.bin && strip vecel.$CHAN.bin
ls -alh /tmp/vecel.*.bin
Results:
-rwxr-xr-x 1 fklock wheel 253K Jun 27 13:35 /tmp/vecel.2017-06-18.bin
-rwxr-xr-x 1 fklock wheel 253K Jun 27 13:35 /tmp/vecel.2017-06-19.bin
-rwxr-xr-x 1 fklock wheel 266K Jun 27 13:35 /tmp/vecel.2017-06-20.bin
-rwxr-xr-x 1 fklock wheel 266K Jun 27 13:35 /tmp/vecel.2017-06-21.bin
-rwxr-xr-x 1 fklock wheel 274K Jun 27 13:35 /tmp/vecel.2017-06-22.bin
-rwxr-xr-x 1 fklock wheel 274K Jun 27 13:35 /tmp/vecel.2017-06-23.bin
-rwxr-xr-x 1 fklock wheel 274K Jun 27 13:35 /tmp/vecel.2017-06-24.bin
-rwxr-xr-x 1 fklock wheel 274K Jun 27 13:35 /tmp/vecel.2017-06-25.bin
-rwxr-xr-x 1 fklock wheel 274K Jun 27 13:35 /tmp/vecel.2017-06-26.bin
-rwxr-xr-x 1 fklock wheel 269K Jun 27 13:35 /tmp/vecel.beta.bin
-rwxr-xr-x 1 fklock wheel 274K Jun 27 13:35 /tmp/vecel.nightly.bin
-rwxr-xr-x 1 fklock wheel 253K Jun 27 13:35 /tmp/vecel.stable.bin
My previous comment notwithstanding, it would be really great if @alexbool could provide us either with their original test case or a somewhat reduced version of it, since it is entirely possible that my (strawman) object size microbenchmark does not actually reflect the real issues that are underpinning the size increase that @alexbool is observing.
@alexcrichton, a little increase in binary size is not important to me. But I am seeing significant increases in the run-time of the binary. I've filed a different issue but the cause could be the same:
@pnkfelix oh note that I added a bunch of #[inline]
in https://github.com/rust-lang/rust/pull/42727 which may help with the perf related to that
This pathological case is producing ~8.5% increase in release mode between 2017-06-19 and 2017-06-23:
main.rs
:
#[macro_use]
extern crate serde;
extern crate serde_yaml;
use std::fs::OpenOptions;
#[derive(Debug, Deserialize)]
pub struct Something {
field1: Vec<S1>,
field2: Vec<S2>,
field3: Vec<S3>,
field4: Vec<S4>,
field5: Vec<S5>,
field6: Vec<S6>,
field7: Vec<S7>,
field8: Vec<S8>,
field9: Vec<S9>,
}
#[derive(Debug, Deserialize)]
pub struct S1(String);
#[derive(Debug, Deserialize)]
pub struct S2(String);
#[derive(Debug, Deserialize)]
pub struct S3(String);
#[derive(Debug, Deserialize)]
pub struct S4(String);
#[derive(Debug, Deserialize)]
pub struct S5(String);
#[derive(Debug, Deserialize)]
pub struct S6(String);
#[derive(Debug, Deserialize)]
pub struct S7(String);
#[derive(Debug, Deserialize)]
pub struct S8(String);
#[derive(Debug, Deserialize)]
pub struct S9(String);
fn main() {
println!(
"{:?}",
serde_yaml::from_reader::<_, Something>(OpenOptions::new().open("whatever").unwrap())
);
}
Cargo.toml
:
[package]
name = "issue-42808"
version = "0.1.0"
authors = ["Alexander Bulaev <alexbool@yandex-team.ru>"]
[dependencies]
serde = { version = "=1.0.8", features = ["derive"] }
serde_yaml = "=0.7.1"
I have a datapoint to provide.
For ease of experimentation, I made a small variation on alexbool's benchmark that avoids pulling in serde. (Basically, my goal was to have a single .rs file that could be independently compiled without any crate dependencies.) I think my benchmark is likely to get at the heart of what is problematic here (namely that the code-duplication from monomorphization of generics can amplify the effect of otherwise minor code size regressions).
Manually inspecting the output assembly and comparing what we generated before and after the Allocator API landed, I saw cases where we could/should be inlining more. So I tried comparing against the branch that is basis for PR #42727, which adds some #[inline]
directives we will be definitely needing to address execution time regressions. Unfortunately, that inlining happens to make the overall code size here worse.
Another thing I noticed was that we seem to be spending a number of instructions just shuffling words around. It wasn't immediately obvious what all the causes were, but one clear candidate was the AllocErr
struct that now needs to be passed around when we return the Err
case. So as an experiment, I tried making the AllocErr
struct zero-sized (i.e. removing all of its content, which is probably not what we want to do long term). I also added inlines of Layout::repeat
and Layout::array
.
rust variant | resulting code size (bytes) |
---|---|
just before Allocator API | 298,840 |
just after Allocator API | 315,792 |
post #42727 | 323,896 |
post #42727 + making AllocErr zero-sized + more Layout inlines |
311,296 |
So this shows that some of the code size regression can perhaps be blamed on the change to alloc's return type; it used to be *mut u8
, now it is Result<*mut u8, AllocErr>
(and I'm curious whether it goes down further if you make that Result<NonZero<*mut u8>, AllocErr>
, though we have already been through arguments against using NonZero
here...)
Update: I attempted to recreate my above results and discovered that I posted my original comment without finishing a crucial sentence: the 311,296 figure was gathered from a build that both made AllocErr
zero-sized and marked more operations of Layout
as #[inline]
.
Thanks for the investigation and minimization @pnkfelix! I'm still pretty surprised about the increase in code there! I think it may be worth digging in a bit more to see where this "code bloat" is coming from. I would suspect that these are legitimately new locations for more Layout
validation, but it would be good to validate that claim itself. If no one else gets around to this I'll try to start investigating after #42727 lands
One thing that I keep seeing in the diffs regardless of how I attempt to "correct" for whatever changes that trait Allocator
injected: there continues to be a big increase in the number of core::ptr::drop_in_place
definitions.
Building my benchmark just before the Allocator
landed has 20 instantiated definitions of core::ptr::drop_in_place
in its assembly output (many of them quite trivial and identical bits of code, sigh). Right after Allocator
landed, there were 31 instantiated definitions of core::ptr::drop_in_place
. But sometime after that, it blew up to 49 instantiated definitions. Not really sure when that happened.
Hm ok so now I'm getting interesting results again:
Using foo.rs
as what @pnkfelix listed above I get:
$ rustc +nightly -V
rustc 1.20.0-nightly (696412de7 2017-07-06)
$ rustc +nightly foo.rs -O && strip -g ./foo && ls -alh foo
-rwxrwxr-x 1 alex alex 483K Jul 7 08:29 foo
$ rustc +nightly-2017-06-01 foo.rs -O && strip -g ./foo && ls -alh foo
-rwxrwxr-x 1 alex alex 476K Jul 7 08:29 foo
However removing derive(Debug)
I get:
$ rustc +nightly foo.rs -O && strip -g ./foo && ls -alh foo
-rwxrwxr-x 1 alex alex 459K Jul 7 08:30 foo
$ rustc +nightly-2017-06-01 foo.rs -O && strip -g ./foo && ls -alh foo
-rwxrwxr-x 1 alex alex 456K Jul 7 08:30 foo
Can this still be reproduced on the latest nightly?
Just a quick note for anyone trying to figure out how the numbers have been jumping up and down (at least, I was initially flummoxed when trying to compare @alexcrichton 's results with the numbers I provided 12 days ago)
The original bug report was comparing rustc 1.19.0-nightly (10d7cb44c 2017-06-18)
against 1.20.0-nightly (445077963 2017-06-20)
. So it was a bit confusing that @alexcrichton used nightly-2017-06-01
as a reference point, because the object size changed quite a bit between May 31st and June 18th, as illustrated here (where foo.rs
is the file @alexcrichton came up with that lacks derive(Debug)
:
% ( OUT=/tmp/foo; rm -f $OUT && RUSTC="rustc +nightly-2017-06-01" ; $RUSTC ~/Dev/Rust/foo.rs -O -o $OUT && strip -g $OUT && $RUSTC --version && ls -balh $OUT )
rustc 1.19.0-nightly (e0cc22b4b 2017-05-31)
-rwxr-xr-x 1 pnkfelix pnkfelix 450K Jul 11 18:01 /tmp/foo
% ( OUT=/tmp/foo; rm -f $OUT && RUSTC="rustc +nightly-2017-06-19" ; $RUSTC ~/Dev/Rust/foo.rs -O -o $OUT && strip -g $OUT && $RUSTC --version && ls -balh $OUT )
rustc 1.19.0-nightly (10d7cb44c 2017-06-18)
-rwxr-xr-x 1 pnkfelix pnkfelix 392K Jul 11 18:01 /tmp/foo
% ( OUT=/tmp/foo; rm -f $OUT && RUSTC="rustc +nightly-2017-06-21" ; $RUSTC ~/Dev/Rust/foo.rs -O -o $OUT && strip -g $OUT && $RUSTC --version && ls -balh $OUT )
rustc 1.20.0-nightly (445077963 2017-06-20)
-rwxr-xr-x 1 pnkfelix pnkfelix 416K Jul 11 18:01 /tmp/foo
% ( OUT=/tmp/foo; rm -f $OUT && RUSTC="rustc +nightly-2017-07-07" ; $RUSTC ~/Dev/Rust/foo.rs -O -o $OUT && strip -g $OUT && $RUSTC --version && ls -balh $OUT )
rustc 1.20.0-nightly (696412de7 2017-07-06)
-rwxr-xr-x 1 pnkfelix pnkfelix 453K Jul 11 18:01 /tmp/foo
% ( OUT=/tmp/foo; rm -f $OUT && RUSTC="rustc +nightly-2017-07-11" ; $RUSTC ~/Dev/Rust/foo.rs -O -o $OUT && strip -g $OUT && $RUSTC --version && ls -balh $OUT )
rustc 1.20.0-nightly (bf0a9e0b4 2017-07-10)
-rwxr-xr-x 1 pnkfelix pnkfelix 453K Jul 11 18:02 /tmp/foo
%
In case its hard to see, here's the important sequence of numbers:
5/31 | 6/18 | 6/20 | 7/07 | 7/11 |
---|---|---|---|---|
450K | 392K | 416K | 453K | 453K |
So, there was an observable regression (and it still seems to be present), but it is potentially masked by other improvements to code size that happened between May 31st and June 18th.
Oh I was actually just picking random dates, looks like I "got lucky"!
Also, another factor that is easy to overlook is that in the earlier comments from both @alexcrichton and myself: our first invocations of strip
had no additional options. The later invocations were strip -g
, which only strips debugging symbols. The difference appears to be quite significant, at least on the binaries we are producing today (on a current nightly compiling hello world, the former yields a 348K binary, the latter yields a 432K binary).
In an effort to try to understand the size trajectories over time, I generalized the benchmark so that one can choose whether one wants to compile it with 1, 2, 3, 11, or 21 distinct variations on the S_
type (and a corresponding number of vectors in the struct Something
), and then I compiled each variant with every nightly version of the compiler from May 1st through July 13th.
My intention was to differentiate between a couple different things:
fn main() { }
change over time?struct Something
that carries a single vector (with that same loop for filling its contents)struct Something
? How about three?struct S_
(and Something
fields and Vec<S_>
instantiations) increases, what is the (amortized) cost per S_?Here is a gist with the program, a shell script to drive it, and two sets of output transcripts (one that uses strip -g
, another using strip
)
https://gist.github.com/pnkfelix/fe632a4bad0759c52e0001efde83e038
Here are some observations based on looking both the development of the numbers over time (i.e. scanning down the columns)
S_
variations to three S_
variations. (I assume this can be explained by the presence of padding between function definition(s) that allows code to increase in size "for free" sometimes.)
objdump
rather than ls
for the fine-grained analysis.(cfb5debbc 2017-06-12)
and (03abb1bd7 2017-06-13)
, the baseline cost of a trivial fn main
dropped quite a bit. This potentially masked later regressions, and its been creeping back up since then.(10d7cb44c 2017-06-18)
and (3bfc18a96 2017-06-29)
(i.e., when the trait Alloc
API landed, plus some other things I have not yet identified), the amortized cost per S_
type variation jumped from 409 bytes to 1638 bytes,
strip
; with strip -g
, the jump was from 510 bytes to (about) 2000 bytes.Thanks for the scripts and investigation @pnkfelix! Lots of good data.
Between (cfb5debbc 2017-06-12) and (03abb1bd7 2017-06-13), the baseline cost of a trivial fn main dropped quite a bit. This potentially masked later regressions, and its been creeping back up since then.
This range of commits notably includes https://github.com/rust-lang/rust/pull/42566 which probably just means that jemalloc 4.0.0 is a smaller binary than jemalloc 4.5.0
I tweaked the script/source file to account for this and always link to the system allocator and got these results which notably shows the same point of a regression you saw, and that the addition of #[inline]
didn't actually help and may have worsened it!
cc @rust-lang/compiler, with @pnkfelix away we're going to need some help tackling this one.
Needs an assignee. @pnkfelix is away.
Discussed in compiler meeting. @alexcrichton is going to attempt to get more profiling information.
So in terms of regressions I think one of the most actionable right now is the regression in compile time. If you take a look at the associated date range for when https://github.com/rust-lang/rust/pull/42313 landed there's two regressions of note:
syntex_syntax
crate specificallySo taking a look specifically at the second regression, with some help from dmajor on IRC we've reached the conclusion that rustc_trans-bb5560cf0f088d86.dll!rustc_trans::mir::trans_mir::{{closure}}
grew about 30% in the profiles.
The compiler team looked at this during triage. These are very interesting findings already! Maybe we can find out what exactly is slower in trans_mir()
? More detailed profiling of that would be interesting.
cc @arielb1, is this something you could help with?
I can try
Not sure if this adds to what's already known, but I've been testing a tool to compare DWARF debug info. I'm looking at the differences between nightly-2017-06-20 and nightly-2017-06-21 for the test case that @pnkfelix gave using --cfg twenty_one
.
The main size difference is for RawVec::double
, which seems to be because we're inlining some function specializations instead of being able to call general allocator functions:
- fn alloc::raw_vec::RawVec<foo::S1>::double<foo::S1>
- size: 109
- inlined subroutines:
- [3] alloc::raw_vec::RawVec<foo::S1>::ptr<foo::S1>
- [10] alloc::heap::reallocate
- [21] alloc::heap::allocate
- calls:
- 0x18b10 alloc_jemalloc::imp::__rust_allocate
- 0x189f0 alloc::oom::oom
- 0x18b40 alloc_jemalloc::imp::__rust_reallocate
+ fn alloc::raw_vec::RawVec<foo::S1, alloc::heap::HeapAlloc>::double<foo::S1,alloc::heap::HeapAlloc>
+ size: 644
+ inlined subroutines:
+ [325] alloc::allocator::Alloc::realloc_array<alloc::heap::HeapAlloc,foo::S1>
+ [2] alloc::allocator::Alloc::oom<alloc::heap::HeapAlloc>
+ [184] alloc::allocator::Alloc::alloc_array<alloc::heap::HeapAlloc,foo::S1>
+ calls:
+ 0x1e190 alloc::allocator::Layout::from_size_align
+ 0x3bb20 core::panicking::panic
+ 0x1e1e0 alloc::allocator::Layout::repeat
+ 0x1e250 alloc::allocator::AllocErr::invalid_input
+ 0x1e1d0 alloc::allocator::Layout::size
+ 0x1e270 alloc::heap::{{impl}}::alloc
+ 0x1e190 alloc::allocator::Layout::from_size_align
+ 0x1e1e0 alloc::allocator::Layout::repeat
+ 0x1e190 alloc::allocator::Layout::from_size_align
+ 0x1e1e0 alloc::allocator::Layout::repeat
+ 0x1e250 alloc::allocator::AllocErr::invalid_input
+ 0x1e1d0 alloc::allocator::Layout::size
+ 0x1e1d0 alloc::allocator::Layout::size
+ 0x1e2f0 alloc::heap::{{impl}}::realloc
There's also a large size difference in main:
fn foo::main
- size: 2378
+ size: 3306
It's mostly due to changes in the size of the inlined push
:
- [59] alloc::vec::Vec<foo::S2>::push<foo::S2>
+ [107] alloc::vec::Vec<foo::S2>::push<foo::S2>
[7] alloc::vec::{{impl}}::deref_mut<foo::S2>
- [7] alloc::raw_vec::RawVec<foo::S2>::ptr<foo::S2>
+ [7] alloc::raw_vec::RawVec<foo::S2, alloc::heap::HeapAlloc>::ptr<foo::S2,alloc::heap::HeapAlloc>
+ [4] core::ptr::{{impl}}::offset<foo::S2>
- [17] core::ptr::write<foo::S2>
+ [23] core::ptr::write<foo::S2>
+ [17] core::ptr::drop_in_place<foo::S2>
+ [17] core::ptr::drop_in_place<alloc::string::String>
I can dig further into other compiler versions if this sort of info is useful.
Thanks for the info @philipc! I think I'm noticing a few things with a fresh set of eyes, I'll send a PR soon.
I've opened https://github.com/rust-lang/rust/pull/43815 with the results of micro-analyizing RawVec::double
.
@alexbool's example had increased from 675k to 720k by the allocator PR, had a bit of upward creep, and was decreased from 732k to 723k by #43815, so there's still much to improve.
**** YAML
UNSTRIPPED: 4070392
STRIPPED: 675304
**** VECEL
UNSTRIPPED NODI: 3645592
text data bss dec hex filename
302104 14329 3704 320137 4e289 vecel
STRIPPED NODI: 322952
text data bss dec hex filename
302104 14329 3704 320137 4e289 vecel
UNSTRIPPED DI: 3667864
text data bss dec hex filename
302138 14329 3704 320171 4e2ab vecel
STRIPPED DI: 323040
text data bss dec hex filename
302138 14329 3704 320171 4e2ab vecel
rustc 1.19.0-nightly (1143eb26a 2017-06-20) - allocators PR
**** YAML
UNSTRIPPED: 4150288
STRIPPED: 720360
**** VECEL
UNSTRIPPED NODI: 3687424
text data bss dec hex filename
312056 14329 3704 330089 50969 vecel
STRIPPED NODI: 331144
text data bss dec hex filename
312056 14329 3704 330089 50969 vecel
UNSTRIPPED DI: 3714328
text data bss dec hex filename
312130 14329 3704 330163 509b3 vecel
STRIPPED DI: 331232
text data bss dec hex filename
312130 14329 3704 330163 509b3 vecel
rustc 1.20.0-nightly (1685c9298 2017-07-06) - second allocators PR
**** YAML
UNSTRIPPED: 4338000
STRIPPED: 732648
**** VECEL
UNSTRIPPED NODI: 3882392
text data bss dec hex filename
338048 14441 3704 356193 56f61 vecel
STRIPPED NODI: 355720
text data bss dec hex filename
338048 14441 3704 356193 56f61 vecel
UNSTRIPPED DI: 3914568
text data bss dec hex filename
338074 14441 3704 356219 56f7b vecel
STRIPPED DI: 355808
text data bss dec hex filename
338074 14441 3704 356219 56f7b vecel
rustc 1.21.0-nightly (6f4ab9458 2017-08-16) - after @alexcrichton's improvement
**** YAML
UNSTRIPPED: 4316480
STRIPPED: 723616
**** VECEL
UNSTRIPPED NODI: 3869144
text data bss dec hex filename
335444 14433 3704 353581 5652d vecel
STRIPPED NODI: 354912
text data bss dec hex filename
335444 14433 3704 353581 5652d vecel
UNSTRIPPED DI: 3897952
text data bss dec hex filename
335590 14433 3704 353727 565bf vecel
STRIPPED DI: 354992
text data bss dec hex filename
335590 14433 3704 353727 565bf vecel
I think someone needs to do a comparative analysis between the 1143eb26a and 1685c9298 versions - I bet much of that bloat is still left in today's rustc.
More accurate size statistics
Sizes:
section size addr
.interp 28 624
.note.ABI-tag 32 652
.note.gnu.build-id 36 684
.gnu.hash 176 720
.dynsym 2472 896
.dynstr 1627 3368
.gnu.version 206 4996
.gnu.version_r 336 5208
.rela.dyn 24288 5544
.rela.plt 1968 29832
.init 23 31800
.plt 1328 31824
.plt.got 16 33152
.text 473545 33168
.fini 9 506716
.rodata 41952 506752
.debug_gdb_scripts 34 548704
.eh_frame_hdr 8724 548740
.eh_frame 41596 557464
.gcc_except_table 39432 599060
.tdata 304 2735936
.init_array 16 2736240
.fini_array 8 2736256
.jcr 8 2736264
.data.rel.ro 19672 2736288
.dynamic 560 2755960
.got 72 2756520
.got.plt 680 2756608
.data 425 2757312
.bss 3704 2757760
.comment 92 0
Total 663369
issue-42808-with-di-after-ai.strip :
section size addr
.interp 28 624
.note.ABI-tag 32 652
.note.gnu.build-id 36 684
.gnu.hash 176 720
.dynsym 2472 896
.dynstr 1627 3368
.gnu.version 206 4996
.gnu.version_r 336 5208
.rela.dyn 24336 5544
.rela.plt 1968 29880
.init 23 31848
.plt 1328 31872
.plt.got 16 33200
.text 499305 33216
.fini 9 532524
.rodata 42336 532544
.debug_gdb_scripts 34 574880
.eh_frame_hdr 9020 574916
.eh_frame 42892 583936
.gcc_except_table 53896 626828
.tdata 304 2780960
.init_array 16 2781264
.fini_array 8 2781280
.jcr 8 2781288
.data.rel.ro 19720 2781312
.dynamic 560 2801032
.got 72 2801592
.got.plt 680 2801664
.data 425 2802368
.bss 3704 2802816
.comment 92 0
Total 705665
issue-42808-with-di-modern.strip :
section size addr
.interp 28 624
.note.ABI-tag 32 652
.note.gnu.build-id 36 684
.gnu.hash 176 720
.dynsym 2472 896
.dynstr 1627 3368
.gnu.version 206 4996
.gnu.version_r 336 5208
.rela.dyn 24528 5544
.rela.plt 1968 30072
.init 23 32040
.plt 1328 32064
.plt.got 16 33392
.text 501945 33408
.fini 9 535356
.rodata 42656 535424
.debug_gdb_scripts 34 578080
.eh_frame_hdr 9988 578116
.eh_frame 46988 588104
.gcc_except_table 56644 635092
.tdata 304 2792448
.init_array 16 2792752
.fini_array 8 2792768
.jcr 8 2792776
.data.rel.ro 19824 2792800
.dynamic 576 2812624
.got 752 2813200
.data 425 2813952
.bss 3704 2814400
.comment 92 0
Total 716749
So the sections that grew were:
eh_frame_hdr 8724 9020 9988
eh_frame 41596 42892 46988
text 473545 499305 501945
gcc_except_table 39432 53896 56644
OK. So, reasons for the bloat I've found:
Filling out an AllocErr
for oom
takes code size. The previous instruction sequence was just
f1eb: e8 10 40 04 00 callq 53200 <__rust_allocate>
f1f0: 49 89 c6 mov %rax,%r14
f1f3: 4d 85 f6 test %r14,%r14
f1f6: 0f 85 d3 02 00 00 jne f4cf
f1fc: e8 8f 3b 04 00 callq 52d90 <_ZN5alloc3oom3oom17h5b02814f1abf9784E>
(22 bytes)
But the new instruction sequence is more of the form:
113ab: e8 e0 ab 02 00 callq 3bf90 <__rust_alloc>
113b0: 48 85 c0 test %rax,%rax
113b3: 0f 85 d6 02 00 00 jne 1168f
113b9: 0f 10 85 68 ff ff ff movups -0x98(%rbp),%xmm0
113c0: 0f 29 45 b0 movaps %xmm0,-0x50(%rbp)
113c4: 0f 28 45 b0 movaps -0x50(%rbp),%xmm0
113c8: 0f 11 85 68 ff ff ff movups %xmm0,-0x98(%rbp)
113cf: 48 8d bd c0 fe ff ff lea -0x140(%rbp),%rdi
113d6: 48 8d b5 60 ff ff ff lea -0xa0(%rbp),%rsi
113dd: e8 5e 86 02 00 callq 39a40 <_ZN61_$LT$alloc..heap..Heap$u20$as$u20$alloc..allocator..Alloc$GT$3oom17h2dd421ed98fddc30E>
(55 bytes)
There are ~100 call-sites to __rust_alloc
, so this causes 3kB of bloat.
__rust_allocate
and __rust_deallocate
(in the old code) were marked as nounwind, but in the new code they are marked as unwind. I think this leads to most of the exception table bloat, and some code bloat.
I remember seeing more inefficiencies, but I didn't write them down.
I'll try to build a rustc with __rust_alloc
marked as nounwind
and see how it fares.
I'm having some problems reproducing the exact sizes locally - I'll try more precisely tomorrow - but marking __rust_dealloc
and __rust_oom
as nounwind
saved 24k of space (that's 50% of the overall increase) in one version.
I think I have a summary of the situation:
~3000b of "OOM-related" bloat
2680b of non-dealloc unwind bloat
22788b of dealloc-related unwind bloat
14544b jemalloc-related bloat
---------------------------------
43012b explained bloat
7836b unexplained bloat
=================================
50848b total bloat
The "jemalloc-related bloat" is an increase in the size of liballoc_jemalloc
, and it seems to be caused by things like quarantine being enabled. It did not come with the first allocator PR.
I'm not entirely sure what the unexplained bloat is or where it hides, but it's only 8k so I think I'll let it slide.
Thanks for the analysis @arielb1! It sounds to me though like there's not necessarily a huge amount to do here. The jemalloc bloat isn't something that should be a problem (small projects turn it off anyway) and the dealloc-related "unwind bloat" is more of a correctness fix right now than anything else. Also if projects care about binary size they're probably compiling with -C panic=abort
.
That just leaves 8k of unexplained size along with 3k of OOM-related stuff, both of which seem quite reasonable and ripe for future optimization.
In that sense with a lack of clear actionable bug, is it time to close this?
In which cases can __rust_dealloc
unwind? Doesn't it just call the jemalloc function, which can never unwind?
So I think the 8k bloat was caused by other changes from 1143eb26a2e405cdccbf6789c89d4581ad969868 to 6f4ab9458a7ad06c8ce630604f533c8c0c0acef4, and I don't think there's anything interesting left for me to investigate.
I'm leaving this open for T-libs to decide what to do about it.
Seeing how no global allocators right now can unwind (system/jemalloc) I've proposed https://github.com/rust-lang/rust/pull/44049 to fix the regression. I'll open a further issue for more discussion here related to unwinding allocators.
This is fixed on beta, removing from milestone.
Between
rustc 1.19.0-nightly (10d7cb44c 2017-06-18)
and1.20.0-nightly (445077963 2017-06-20)
I experienced one of my binaries to grow up from 3.4M to 3.7M, which is approx 8.8%. At first I was scared that this was a fallout of my own PR (#42716), but that particular binary didn't use C strings almost at all. I examined a diff in symbols made withnm
and saw that the binary compiled with1.20.0-nightly (445077963 2017-06-20)
has a lot more symbols like that:_<alloc::raw_vec::RawVec<T, A>>::dealloc_buffer
(14 occurences)_alloc::allocator::Alloc::alloc_array
(57 occurences)_alloc::allocator::Alloc::realloc_array
(28 occurences)_core::ptr::drop_in_place
(mighty 138 new occurences)This leads us to the possibility that #42313 is the culprit.
Current Situation
There are 2 reproducers, and I need to make some measurements: