Binary size blowup presumably because of allocator integration

alexbool commented 7 years ago

Between rustc 1.19.0-nightly (10d7cb44c 2017-06-18) and 1.20.0-nightly (445077963 2017-06-20) I experienced one of my binaries to grow up from 3.4M to 3.7M, which is approx 8.8%. At first I was scared that this was a fallout of my own PR (#42716), but that particular binary didn't use C strings almost at all. I examined a diff in symbols made with nm and saw that the binary compiled with 1.20.0-nightly (445077963 2017-06-20) has a lot more symbols like that:

_<alloc::raw_vec::RawVec<T, A>>::dealloc_buffer (14 occurences)
_alloc::allocator::Alloc::alloc_array (57 occurences)
_alloc::allocator::Alloc::realloc_array (28 occurences)
_core::ptr::drop_in_place (mighty 138 new occurences)

This leads us to the possibility that #42313 is the culprit.

Current Situation

There are 2 reproducers, and I need to make some measurements:

leonardo-m commented 7 years ago

The blowup of binary size is the minor difference. The main difference is in compilation times and run-times of that resulting binary, that are quite worse.

Mark-Simulacrum commented 7 years ago

cc @alexcrichton @pnkfelix

alexcrichton commented 7 years ago

@alexbool can you enable us to reproduce this? I can investigate in parallel but bug reports tend to be much more readily fixable if there's a way to reproduce what caused the bug report!

alexcrichton commented 7 years ago

Looks like this isn't a regression for "hello world", or likely just "in the noise"

$ rustc +stable -V
rustc 1.18.0 (03fc9d622 2017-06-06)
$ rustc +stable foo.rs && strip foo && ls -alh foo
-rwxrwxr-x 1 alex alex 316K Jun 21 18:47 foo
$ rustc +beta -V
rustc 1.19.0-beta.2 (a175ee509 2017-06-15)
$ rustc +beta foo.rs && strip foo && ls -alh foo
-rwxrwxr-x 1 alex alex 368K Jun 21 18:47 foo
$ rsutc +nightly -V
rustc 1.20.0-nightly (622e7e648 2017-06-21)
$ rustc +nightly foo.rs && strip foo && ls -alh foo
-rwxrwxr-x 1 alex alex 324K Jun 21 18:47 foo

alexbool commented 7 years ago

I'll try to make a minimal reproducible example today or later this week

alexcrichton commented 7 years ago

@alexbool any updates on this?

@leonardo-m or @hinaria do y'all have examples of this causing a size increase?

alexbool commented 7 years ago

Unfortunately didn't have much time to dig into this. Will try as soon as I have possibility.

pnkfelix commented 7 years ago

Here is a test case that may serve for exhibiting the space increase that @alexbool observed around this time.

(Or at least, on my Mac, building without optimizations, I am observing a size increase of about 1.08x on the stripped binaries. With optimizations, the size increase is about 1.05x)

// vecel.rs

use std::env;

fn main() {
    let mut args = env::args();
    args.next();
    let arg1 = args.next().unwrap();
    let num: usize = arg1.parse().unwrap();
    let mut v = Vec::new();
    for i in 0..num {
        v.push(i);
    }
    assert_eq!(v.len(), num);
}

Build script:

#!/bin/sh # vecel.sh

function f() {
    OUT=vecel.$(rustc --version | sed -e 's/(//' -e 's/)//' | cut -d ' ' -f 4).bin ; rustc --version && rustc ~/Dev/Rust/vecel.rs -o $OUT && strip $OUT
}

for day in 19 20 21 22 23 24 25 26 27; do
    rustup default nightly-2017-06-$day;
    f;
done

CHAN=stable;  rustc +$CHAN ~/Dev/Rust/vecel.rs -o vecel.$CHAN.bin && strip vecel.$CHAN.bin
CHAN=beta;    rustc +$CHAN ~/Dev/Rust/vecel.rs -o vecel.$CHAN.bin && strip vecel.$CHAN.bin
CHAN=nightly; rustc +$CHAN ~/Dev/Rust/vecel.rs -o vecel.$CHAN.bin && strip vecel.$CHAN.bin

ls -alh /tmp/vecel.*.bin

Results:

-rwxr-xr-x  1 fklock  wheel   253K Jun 27 13:35 /tmp/vecel.2017-06-18.bin
-rwxr-xr-x  1 fklock  wheel   253K Jun 27 13:35 /tmp/vecel.2017-06-19.bin
-rwxr-xr-x  1 fklock  wheel   266K Jun 27 13:35 /tmp/vecel.2017-06-20.bin
-rwxr-xr-x  1 fklock  wheel   266K Jun 27 13:35 /tmp/vecel.2017-06-21.bin
-rwxr-xr-x  1 fklock  wheel   274K Jun 27 13:35 /tmp/vecel.2017-06-22.bin
-rwxr-xr-x  1 fklock  wheel   274K Jun 27 13:35 /tmp/vecel.2017-06-23.bin
-rwxr-xr-x  1 fklock  wheel   274K Jun 27 13:35 /tmp/vecel.2017-06-24.bin
-rwxr-xr-x  1 fklock  wheel   274K Jun 27 13:35 /tmp/vecel.2017-06-25.bin
-rwxr-xr-x  1 fklock  wheel   274K Jun 27 13:35 /tmp/vecel.2017-06-26.bin
-rwxr-xr-x  1 fklock  wheel   269K Jun 27 13:35 /tmp/vecel.beta.bin
-rwxr-xr-x  1 fklock  wheel   274K Jun 27 13:35 /tmp/vecel.nightly.bin
-rwxr-xr-x  1 fklock  wheel   253K Jun 27 13:35 /tmp/vecel.stable.bin

pnkfelix commented 7 years ago

My previous comment notwithstanding, it would be really great if @alexbool could provide us either with their original test case or a somewhat reduced version of it, since it is entirely possible that my (strawman) object size microbenchmark does not actually reflect the real issues that are underpinning the size increase that @alexbool is observing.

leonardo-m commented 7 years ago

@alexcrichton, a little increase in binary size is not important to me. But I am seeing significant increases in the run-time of the binary. I've filed a different issue but the cause could be the same:

https://github.com/rust-lang/rust/issues/42935

alexcrichton commented 7 years ago

@pnkfelix oh note that I added a bunch of #[inline] in https://github.com/rust-lang/rust/pull/42727 which may help with the perf related to that

alexbool commented 7 years ago

This pathological case is producing ~8.5% increase in release mode between 2017-06-19 and 2017-06-23: main.rs:

#[macro_use]
extern crate serde;
extern crate serde_yaml;

use std::fs::OpenOptions;

#[derive(Debug, Deserialize)]
pub struct Something {
    field1: Vec<S1>,
    field2: Vec<S2>,
    field3: Vec<S3>,
    field4: Vec<S4>,
    field5: Vec<S5>,
    field6: Vec<S6>,
    field7: Vec<S7>,
    field8: Vec<S8>,
    field9: Vec<S9>,
}

#[derive(Debug, Deserialize)]
pub struct S1(String);

#[derive(Debug, Deserialize)]
pub struct S2(String);

#[derive(Debug, Deserialize)]
pub struct S3(String);

#[derive(Debug, Deserialize)]
pub struct S4(String);

#[derive(Debug, Deserialize)]
pub struct S5(String);

#[derive(Debug, Deserialize)]
pub struct S6(String);

#[derive(Debug, Deserialize)]
pub struct S7(String);

#[derive(Debug, Deserialize)]
pub struct S8(String);

#[derive(Debug, Deserialize)]
pub struct S9(String);

fn main() {
    println!(
        "{:?}",
        serde_yaml::from_reader::<_, Something>(OpenOptions::new().open("whatever").unwrap())
    );
}

Cargo.toml:

[package]
name = "issue-42808"
version = "0.1.0"
authors = ["Alexander Bulaev <alexbool@yandex-team.ru>"]

[dependencies]
serde = { version = "=1.0.8", features = ["derive"] }
serde_yaml = "=0.7.1"

pnkfelix commented 7 years ago

I have a datapoint to provide.

For ease of experimentation, I made a small variation on alexbool's benchmark that avoids pulling in serde. (Basically, my goal was to have a single .rs file that could be independently compiled without any crate dependencies.) I think my benchmark is likely to get at the heart of what is problematic here (namely that the code-duplication from monomorphization of generics can amplify the effect of otherwise minor code size regressions).

Manually inspecting the output assembly and comparing what we generated before and after the Allocator API landed, I saw cases where we could/should be inlining more. So I tried comparing against the branch that is basis for PR #42727, which adds some #[inline] directives we will be definitely needing to address execution time regressions. Unfortunately, that inlining happens to make the overall code size here worse.

Another thing I noticed was that we seem to be spending a number of instructions just shuffling words around. It wasn't immediately obvious what all the causes were, but one clear candidate was the AllocErr struct that now needs to be passed around when we return the Err case. So as an experiment, I tried making the AllocErr struct zero-sized (i.e. removing all of its content, which is probably not what we want to do long term). I also added inlines of Layout::repeat and Layout::array.

rust variant	resulting code size (bytes)
just before Allocator API	`298,840`
just after Allocator API	`315,792`
post #42727	`323,896`
post #42727 + making `AllocErr` zero-sized + more `Layout` inlines	`311,296`

So this shows that some of the code size regression can perhaps be blamed on the change to alloc's return type; it used to be *mut u8, now it is Result<*mut u8, AllocErr> (and I'm curious whether it goes down further if you make that Result<NonZero<*mut u8>, AllocErr>, though we have already been through arguments against using NonZero here...)

Update: I attempted to recreate my above results and discovered that I posted my original comment without finishing a crucial sentence: the 311,296 figure was gathered from a build that both made AllocErr zero-sized and marked more operations of Layout as #[inline].

alexcrichton commented 7 years ago

Thanks for the investigation and minimization @pnkfelix! I'm still pretty surprised about the increase in code there! I think it may be worth digging in a bit more to see where this "code bloat" is coming from. I would suspect that these are legitimately new locations for more Layout validation, but it would be good to validate that claim itself. If no one else gets around to this I'll try to start investigating after #42727 lands

pnkfelix commented 7 years ago

One thing that I keep seeing in the diffs regardless of how I attempt to "correct" for whatever changes that trait Allocator injected: there continues to be a big increase in the number of core::ptr::drop_in_place definitions.

Building my benchmark just before the Allocator landed has 20 instantiated definitions of core::ptr::drop_in_place in its assembly output (many of them quite trivial and identical bits of code, sigh). Right after Allocator landed, there were 31 instantiated definitions of core::ptr::drop_in_place. But sometime after that, it blew up to 49 instantiated definitions. Not really sure when that happened.

alexcrichton commented 7 years ago

Hm ok so now I'm getting interesting results again:

Using foo.rs as what @pnkfelix listed above I get:

$ rustc +nightly -V
rustc 1.20.0-nightly (696412de7 2017-07-06)
$ rustc +nightly foo.rs -O && strip -g ./foo && ls -alh foo
-rwxrwxr-x 1 alex alex 483K Jul  7 08:29 foo
$ rustc +nightly-2017-06-01 foo.rs -O && strip -g ./foo && ls -alh foo
-rwxrwxr-x 1 alex alex 476K Jul  7 08:29 foo

However removing derive(Debug) I get:

$ rustc +nightly foo.rs -O && strip -g ./foo && ls -alh foo
-rwxrwxr-x 1 alex alex 459K Jul  7 08:30 foo
$ rustc +nightly-2017-06-01 foo.rs -O && strip -g ./foo && ls -alh foo
-rwxrwxr-x 1 alex alex 456K Jul  7 08:30 foo

Can this still be reproduced on the latest nightly?

pnkfelix commented 7 years ago

Just a quick note for anyone trying to figure out how the numbers have been jumping up and down (at least, I was initially flummoxed when trying to compare @alexcrichton 's results with the numbers I provided 12 days ago)

The original bug report was comparing rustc 1.19.0-nightly (10d7cb44c 2017-06-18) against 1.20.0-nightly (445077963 2017-06-20). So it was a bit confusing that @alexcrichton used nightly-2017-06-01 as a reference point, because the object size changed quite a bit between May 31st and June 18th, as illustrated here (where foo.rs is the file @alexcrichton came up with that lacks derive(Debug):

% ( OUT=/tmp/foo; rm -f $OUT && RUSTC="rustc +nightly-2017-06-01" ; $RUSTC ~/Dev/Rust/foo.rs -O -o $OUT && strip -g $OUT && $RUSTC --version && ls -balh $OUT )
rustc 1.19.0-nightly (e0cc22b4b 2017-05-31)
-rwxr-xr-x 1 pnkfelix pnkfelix 450K Jul 11 18:01 /tmp/foo
% ( OUT=/tmp/foo; rm -f $OUT && RUSTC="rustc +nightly-2017-06-19" ; $RUSTC ~/Dev/Rust/foo.rs -O -o $OUT && strip -g $OUT && $RUSTC --version && ls -balh $OUT )
rustc 1.19.0-nightly (10d7cb44c 2017-06-18)
-rwxr-xr-x 1 pnkfelix pnkfelix 392K Jul 11 18:01 /tmp/foo
% ( OUT=/tmp/foo; rm -f $OUT && RUSTC="rustc +nightly-2017-06-21" ; $RUSTC ~/Dev/Rust/foo.rs -O -o $OUT && strip -g $OUT && $RUSTC --version && ls -balh $OUT )
rustc 1.20.0-nightly (445077963 2017-06-20)
-rwxr-xr-x 1 pnkfelix pnkfelix 416K Jul 11 18:01 /tmp/foo
% ( OUT=/tmp/foo; rm -f $OUT && RUSTC="rustc +nightly-2017-07-07" ; $RUSTC ~/Dev/Rust/foo.rs -O -o $OUT && strip -g $OUT && $RUSTC --version && ls -balh $OUT )
rustc 1.20.0-nightly (696412de7 2017-07-06)
-rwxr-xr-x 1 pnkfelix pnkfelix 453K Jul 11 18:01 /tmp/foo
% ( OUT=/tmp/foo; rm -f $OUT && RUSTC="rustc +nightly-2017-07-11" ; $RUSTC ~/Dev/Rust/foo.rs -O -o $OUT && strip -g $OUT && $RUSTC --version && ls -balh $OUT )
rustc 1.20.0-nightly (bf0a9e0b4 2017-07-10)
-rwxr-xr-x 1 pnkfelix pnkfelix 453K Jul 11 18:02 /tmp/foo
%

In case its hard to see, here's the important sequence of numbers:

5/31	6/18	6/20	7/07	7/11
450K	392K	416K	453K	453K

So, there was an observable regression (and it still seems to be present), but it is potentially masked by other improvements to code size that happened between May 31st and June 18th.

alexcrichton commented 7 years ago

Oh I was actually just picking random dates, looks like I "got lucky"!

pnkfelix commented 7 years ago

Also, another factor that is easy to overlook is that in the earlier comments from both @alexcrichton and myself: our first invocations of strip had no additional options. The later invocations were strip -g, which only strips debugging symbols. The difference appears to be quite significant, at least on the binaries we are producing today (on a current nightly compiling hello world, the former yields a 348K binary, the latter yields a 432K binary).

pnkfelix commented 7 years ago

In an effort to try to understand the size trajectories over time, I generalized the benchmark so that one can choose whether one wants to compile it with 1, 2, 3, 11, or 21 distinct variations on the S_ type (and a corresponding number of vectors in the struct Something), and then I compiled each variant with every nightly version of the compiler from May 1st through July 13th.

My intention was to differentiate between a couple different things:

How does the size of a trivial fn main() { } change over time?
How much does that trivial program increase in size when you add the most basic instance of the struct Something that carries a single vector (with that same loop for filling its contents)
How much does that program increase in size when you have two distinctly-typed vectors in struct Something? How about three?
As the number of variations of struct S_ (and Something fields and Vec<S_> instantiations) increases, what is the (amortized) cost per S_?

Here is a gist with the program, a shell script to drive it, and two sets of output transcripts (one that uses strip -g, another using strip)

https://gist.github.com/pnkfelix/fe632a4bad0759c52e0001efde83e038

Here are some observations based on looking both the development of the numbers over time (i.e. scanning down the columns)

The numbers sometimes report no binary size increase when one just goes from two S_ variations to three S_ variations. (I assume this can be explained by the presence of padding between function definition(s) that allows code to increase in size "for free" sometimes.)
- This is probably a sign that I should switch to using objdump rather than ls for the fine-grained analysis.
Between (cfb5debbc 2017-06-12) and (03abb1bd7 2017-06-13), the baseline cost of a trivial fn main dropped quite a bit. This potentially masked later regressions, and its been creeping back up since then.
Between (10d7cb44c 2017-06-18) and (3bfc18a96 2017-06-29) (i.e., when the trait Alloc API landed, plus some other things I have not yet identified), the amortized cost per S_ type variation jumped from 409 bytes to 1638 bytes,
- That's when using strip; with strip -g, the jump was from 510 bytes to (about) 2000 bytes.

alexcrichton commented 7 years ago

Thanks for the scripts and investigation @pnkfelix! Lots of good data.

Between (cfb5debbc 2017-06-12) and (03abb1bd7 2017-06-13), the baseline cost of a trivial fn main dropped quite a bit. This potentially masked later regressions, and its been creeping back up since then.

This range of commits notably includes https://github.com/rust-lang/rust/pull/42566 which probably just means that jemalloc 4.0.0 is a smaller binary than jemalloc 4.5.0

I tweaked the script/source file to account for this and always link to the system allocator and got these results which notably shows the same point of a regression you saw, and that the addition of #[inline] didn't actually help and may have worsened it!

aturon commented 7 years ago

cc @rust-lang/compiler, with @pnkfelix away we're going to need some help tackling this one.

Mark-Simulacrum commented 7 years ago

Needs an assignee. @pnkfelix is away.

nikomatsakis commented 7 years ago

Discussed in compiler meeting. @alexcrichton is going to attempt to get more profiling information.

alexcrichton commented 7 years ago

So in terms of regressions I think one of the most actionable right now is the regression in compile time. If you take a look at the associated date range for when https://github.com/rust-lang/rust/pull/42313 landed there's two regressions of note:

One, the largest, happened from https://github.com/rust-lang/rust/pull/39409. This is tracked by https://github.com/rust-lang/rust/issues/43511
The second, smaller regression happened due to https://github.com/rust-lang/rust/pull/42313. This is perhaps most notable when looking at the syntex_syntax crate specifically

So taking a look specifically at the second regression, with some help from dmajor on IRC we've reached the conclusion that rustc_trans-bb5560cf0f088d86.dll!rustc_trans::mir::trans_mir::{{closure}} grew about 30% in the profiles.

michaelwoerister commented 7 years ago

The compiler team looked at this during triage. These are very interesting findings already! Maybe we can find out what exactly is slower in trans_mir()? More detailed profiling of that would be interesting.

aturon commented 7 years ago

cc @arielb1, is this something you could help with?

arielb1 commented 7 years ago

I can try

philipc commented 7 years ago

Not sure if this adds to what's already known, but I've been testing a tool to compare DWARF debug info. I'm looking at the differences between nightly-2017-06-20 and nightly-2017-06-21 for the test case that @pnkfelix gave using --cfg twenty_one.

The main size difference is for RawVec::double, which seems to be because we're inlining some function specializations instead of being able to call general allocator functions:

- fn alloc::raw_vec::RawVec<foo::S1>::double<foo::S1>
-       size: 109
-       inlined subroutines:
-               [3]     alloc::raw_vec::RawVec<foo::S1>::ptr<foo::S1>
-               [10]    alloc::heap::reallocate
-               [21]    alloc::heap::allocate
-       calls:
-               0x18b10 alloc_jemalloc::imp::__rust_allocate
-               0x189f0 alloc::oom::oom
-               0x18b40 alloc_jemalloc::imp::__rust_reallocate

+ fn alloc::raw_vec::RawVec<foo::S1, alloc::heap::HeapAlloc>::double<foo::S1,alloc::heap::HeapAlloc>
+       size: 644
+       inlined subroutines:
+               [325]   alloc::allocator::Alloc::realloc_array<alloc::heap::HeapAlloc,foo::S1>
+               [2]     alloc::allocator::Alloc::oom<alloc::heap::HeapAlloc>
+               [184]   alloc::allocator::Alloc::alloc_array<alloc::heap::HeapAlloc,foo::S1>
+       calls:
+               0x1e190 alloc::allocator::Layout::from_size_align
+               0x3bb20 core::panicking::panic
+               0x1e1e0 alloc::allocator::Layout::repeat
+               0x1e250 alloc::allocator::AllocErr::invalid_input
+               0x1e1d0 alloc::allocator::Layout::size
+               0x1e270 alloc::heap::{{impl}}::alloc
+               0x1e190 alloc::allocator::Layout::from_size_align
+               0x1e1e0 alloc::allocator::Layout::repeat
+               0x1e190 alloc::allocator::Layout::from_size_align
+               0x1e1e0 alloc::allocator::Layout::repeat
+               0x1e250 alloc::allocator::AllocErr::invalid_input
+               0x1e1d0 alloc::allocator::Layout::size
+               0x1e1d0 alloc::allocator::Layout::size
+               0x1e2f0 alloc::heap::{{impl}}::realloc

There's also a large size difference in main:

  fn foo::main
-   size: 2378
+   size: 3306

It's mostly due to changes in the size of the inlined push:

-       [59]    alloc::vec::Vec<foo::S2>::push<foo::S2>
+       [107]   alloc::vec::Vec<foo::S2>::push<foo::S2>
            [7] alloc::vec::{{impl}}::deref_mut<foo::S2>
-               [7] alloc::raw_vec::RawVec<foo::S2>::ptr<foo::S2>
+               [7] alloc::raw_vec::RawVec<foo::S2, alloc::heap::HeapAlloc>::ptr<foo::S2,alloc::heap::HeapAlloc>
+           [4] core::ptr::{{impl}}::offset<foo::S2>
-           [17]    core::ptr::write<foo::S2>
+           [23]    core::ptr::write<foo::S2>
+           [17]    core::ptr::drop_in_place<foo::S2>
+               [17]    core::ptr::drop_in_place<alloc::string::String>

I can dig further into other compiler versions if this sort of info is useful.

alexcrichton commented 7 years ago

Thanks for the info @philipc! I think I'm noticing a few things with a fresh set of eyes, I'll send a PR soon.

alexcrichton commented 7 years ago

I've opened https://github.com/rust-lang/rust/pull/43815 with the results of micro-analyizing RawVec::double.

arielb1 commented 7 years ago

@alexbool's example had increased from 675k to 720k by the allocator PR, had a bit of upward creep, and was decreased from 732k to 723k by #43815, so there's still much to improve.

arielb1 commented 7 years ago

**** YAML
UNSTRIPPED: 4070392
STRIPPED: 675304
**** VECEL
UNSTRIPPED NODI: 3645592
   text    data     bss     dec     hex filename
 302104   14329    3704  320137   4e289 vecel
STRIPPED NODI: 322952
   text    data     bss     dec     hex filename
 302104   14329    3704  320137   4e289 vecel
UNSTRIPPED DI: 3667864
   text    data     bss     dec     hex filename
 302138   14329    3704  320171   4e2ab vecel
STRIPPED DI: 323040
   text    data     bss     dec     hex filename
 302138   14329    3704  320171   4e2ab vecel
rustc 1.19.0-nightly (1143eb26a 2017-06-20) - allocators PR
**** YAML
UNSTRIPPED: 4150288
STRIPPED: 720360
**** VECEL
UNSTRIPPED NODI: 3687424
   text    data     bss     dec     hex filename
 312056   14329    3704  330089   50969 vecel
STRIPPED NODI: 331144
   text    data     bss     dec     hex filename
 312056   14329    3704  330089   50969 vecel
UNSTRIPPED DI: 3714328
   text    data     bss     dec     hex filename
 312130   14329    3704  330163   509b3 vecel
STRIPPED DI: 331232
   text    data     bss     dec     hex filename                                                                                             
 312130   14329    3704  330163   509b3 vecel                                                                                                
rustc 1.20.0-nightly (1685c9298 2017-07-06) - second allocators PR                                                                           
**** YAML                                                                                                                                    
UNSTRIPPED: 4338000                                                                                                                          
STRIPPED: 732648                                                                                                                             
**** VECEL                                                                                                                                   
UNSTRIPPED NODI: 3882392                                                                                                                     
   text    data     bss     dec     hex filename                                                                                             
 338048   14441    3704  356193   56f61 vecel                                                                                                
STRIPPED NODI: 355720                                                                                                                        
   text    data     bss     dec     hex filename                                                                                             
 338048   14441    3704  356193   56f61 vecel                                                                                                
UNSTRIPPED DI: 3914568                                                                                                                       
   text    data     bss     dec     hex filename                                                                                             
 338074   14441    3704  356219   56f7b vecel                                                                                                
STRIPPED DI: 355808                                                                                                                          
   text    data     bss     dec     hex filename                                                                                             
 338074   14441    3704  356219   56f7b vecel                                                                                                
rustc 1.21.0-nightly (6f4ab9458 2017-08-16) - after @alexcrichton's improvement                                                              
**** YAML                                                                                                                                    
UNSTRIPPED: 4316480                                                                                                                          
STRIPPED: 723616                                                                                                                             
**** VECEL                                                                                                                                   
UNSTRIPPED NODI: 3869144                                                                                                                     
   text    data     bss     dec     hex filename                                                                                             
 335444   14433    3704  353581   5652d vecel                                                                                                
STRIPPED NODI: 354912                                                                                                                        
   text    data     bss     dec     hex filename                                                                                             
 335444   14433    3704  353581   5652d vecel                                                                                                
UNSTRIPPED DI: 3897952                                                                                                                       
   text    data     bss     dec     hex filename                                                                                             
 335590   14433    3704  353727   565bf vecel                                                                                                
STRIPPED DI: 354992                                                                                                                          
   text    data     bss     dec     hex filename                                                                                             
 335590   14433    3704  353727   565bf vecel

arielb1 commented 7 years ago

I think someone needs to do a comparative analysis between the 1143eb26a and 1685c9298 versions - I bet much of that bloat is still left in today's rustc.

arielb1 commented 7 years ago

More accurate size statistics

Sizes:
section                size      addr
.interp                  28       624
.note.ABI-tag            32       652
.note.gnu.build-id       36       684
.gnu.hash               176       720
.dynsym                2472       896
.dynstr                1627      3368
.gnu.version            206      4996
.gnu.version_r          336      5208
.rela.dyn             24288      5544
.rela.plt              1968     29832
.init                    23     31800
.plt                   1328     31824
.plt.got                 16     33152
.text                473545     33168
.fini                     9    506716
.rodata               41952    506752
.debug_gdb_scripts       34    548704
.eh_frame_hdr          8724    548740
.eh_frame             41596    557464
.gcc_except_table     39432    599060
.tdata                  304   2735936
.init_array              16   2736240
.fini_array               8   2736256
.jcr                      8   2736264
.data.rel.ro          19672   2736288
.dynamic                560   2755960
.got                     72   2756520
.got.plt                680   2756608
.data                   425   2757312
.bss                   3704   2757760
.comment                 92         0
Total                663369

issue-42808-with-di-after-ai.strip  :
section                size      addr
.interp                  28       624
.note.ABI-tag            32       652
.note.gnu.build-id       36       684
.gnu.hash               176       720
.dynsym                2472       896
.dynstr                1627      3368
.gnu.version            206      4996
.gnu.version_r          336      5208
.rela.dyn             24336      5544
.rela.plt              1968     29880
.init                    23     31848
.plt                   1328     31872
.plt.got                 16     33200
.text                499305     33216
.fini                     9    532524
.rodata               42336    532544
.debug_gdb_scripts       34    574880
.eh_frame_hdr          9020    574916
.eh_frame             42892    583936
.gcc_except_table     53896    626828
.tdata                  304   2780960
.init_array              16   2781264
.fini_array               8   2781280
.jcr                      8   2781288
.data.rel.ro          19720   2781312
.dynamic                560   2801032
.got                     72   2801592
.got.plt                680   2801664
.data                   425   2802368
.bss                   3704   2802816
.comment                 92         0
Total                705665

issue-42808-with-di-modern.strip  :
section                size      addr
.interp                  28       624
.note.ABI-tag            32       652
.note.gnu.build-id       36       684
.gnu.hash               176       720
.dynsym                2472       896
.dynstr                1627      3368
.gnu.version            206      4996
.gnu.version_r          336      5208
.rela.dyn             24528      5544
.rela.plt              1968     30072
.init                    23     32040
.plt                   1328     32064
.plt.got                 16     33392
.text                501945     33408
.fini                     9    535356
.rodata               42656    535424
.debug_gdb_scripts       34    578080
.eh_frame_hdr          9988    578116
.eh_frame             46988    588104
.gcc_except_table     56644    635092
.tdata                  304   2792448
.init_array              16   2792752
.fini_array               8   2792768
.jcr                      8   2792776
.data.rel.ro          19824   2792800
.dynamic                576   2812624
.got                    752   2813200
.data                   425   2813952
.bss                   3704   2814400
.comment                 92         0
Total                716749

arielb1 commented 7 years ago

So the sections that grew were:

eh_frame_hdr 8724 9020 9988
eh_frame 41596 42892 46988
text 473545 499305 501945
gcc_except_table 39432 53896 56644

arielb1 commented 7 years ago

OK. So, reasons for the bloat I've found:

Filling out an AllocErr for oom takes code size. The previous instruction sequence was just

f1eb:   e8 10 40 04 00          callq  53200 <__rust_allocate>
f1f0:   49 89 c6                mov    %rax,%r14
f1f3:   4d 85 f6                test   %r14,%r14
f1f6:   0f 85 d3 02 00 00       jne    f4cf 
f1fc:   e8 8f 3b 04 00          callq  52d90 <_ZN5alloc3oom3oom17h5b02814f1abf9784E>

(22 bytes)

But the new instruction sequence is more of the form:

113ab:   e8 e0 ab 02 00          callq  3bf90 <__rust_alloc>
113b0:   48 85 c0                test   %rax,%rax
113b3:   0f 85 d6 02 00 00       jne    1168f 
113b9:   0f 10 85 68 ff ff ff    movups -0x98(%rbp),%xmm0
113c0:   0f 29 45 b0             movaps %xmm0,-0x50(%rbp)
113c4:   0f 28 45 b0             movaps -0x50(%rbp),%xmm0
113c8:   0f 11 85 68 ff ff ff    movups %xmm0,-0x98(%rbp)
113cf:   48 8d bd c0 fe ff ff    lea    -0x140(%rbp),%rdi
113d6:   48 8d b5 60 ff ff ff    lea    -0xa0(%rbp),%rsi
113dd:   e8 5e 86 02 00          callq  39a40  <_ZN61_$LT$alloc..heap..Heap$u20$as$u20$alloc..allocator..Alloc$GT$3oom17h2dd421ed98fddc30E>

(55 bytes)

There are ~100 call-sites to __rust_alloc, so this causes 3kB of bloat.

__rust_allocate and __rust_deallocate (in the old code) were marked as nounwind, but in the new code they are marked as unwind. I think this leads to most of the exception table bloat, and some code bloat.

I remember seeing more inefficiencies, but I didn't write them down.

I'll try to build a rustc with __rust_alloc marked as nounwind and see how it fares.

arielb1 commented 7 years ago

I'm having some problems reproducing the exact sizes locally - I'll try more precisely tomorrow - but marking __rust_dealloc and __rust_oom as nounwind saved 24k of space (that's 50% of the overall increase) in one version.

arielb1 commented 7 years ago

I think I have a summary of the situation:

 ~3000b of "OOM-related" bloat
  2680b of non-dealloc unwind bloat
 22788b of dealloc-related unwind bloat
 14544b jemalloc-related bloat
 ---------------------------------
 43012b explained bloat
  7836b unexplained bloat
 =================================
 50848b total bloat

The "jemalloc-related bloat" is an increase in the size of liballoc_jemalloc, and it seems to be caused by things like quarantine being enabled. It did not come with the first allocator PR.

I'm not entirely sure what the unexplained bloat is or where it hides, but it's only 8k so I think I'll let it slide.

alexcrichton commented 7 years ago

Thanks for the analysis @arielb1! It sounds to me though like there's not necessarily a huge amount to do here. The jemalloc bloat isn't something that should be a problem (small projects turn it off anyway) and the dealloc-related "unwind bloat" is more of a correctness fix right now than anything else. Also if projects care about binary size they're probably compiling with -C panic=abort.

That just leaves 8k of unexplained size along with 3k of OOM-related stuff, both of which seem quite reasonable and ripe for future optimization.

In that sense with a lack of clear actionable bug, is it time to close this?

arielb1 commented 7 years ago

In which cases can __rust_dealloc unwind? Doesn't it just call the jemalloc function, which can never unwind?

arielb1 commented 7 years ago

So I think the 8k bloat was caused by other changes from 1143eb26a2e405cdccbf6789c89d4581ad969868 to 6f4ab9458a7ad06c8ce630604f533c8c0c0acef4, and I don't think there's anything interesting left for me to investigate.

I'm leaving this open for T-libs to decide what to do about it.

alexcrichton commented 7 years ago

Seeing how no global allocators right now can unwind (system/jemalloc) I've proposed https://github.com/rust-lang/rust/pull/44049 to fix the regression. I'll open a further issue for more discussion here related to unwinding allocators.

alexcrichton commented 7 years ago

This is fixed on beta, removing from milestone.

rust-lang / rust

Binary size blowup presumably because of allocator integration #42808

Current Situation