uutils / coreutils

Cross-platform Rust rewrite of the GNU coreutils
https://uutils.github.io/
MIT License
17.54k stars 1.26k forks source link

Oversized executables #747

Open alexchandel opened 8 years ago

alexchandel commented 8 years ago

The uutils executables are a bit larger than their native counterparts. These are the stats on OS X with O3, LTO, and alloc_system:

Name Native uutils
base64 8.0K 200K
basename 8.0K 160K
cat 8.0K 204K
chmod 12K 524K
chroot 8.0K 220K
cksum 8.0K 180K
comm 8.0K 168K
cp 12K 204K
cut 8.0K 296K
dirname 8.0K 152K
du 12K 328K
echo 8.0K 132K
env 8.0K 148K
expand 8.0K 208K
expr 8.0K 168K
factor - 208K
false 4.0K 80K
fold 8.0K 196K
groups 8.0K 168K
hashsum - 596K
head 8.0K 196K
hostid - 148K
hostname 8.0K 192K
id 8.0K 192K
kill 8.0K 180K
link 8.0K 156K
ln 8.0K 208K
logname 8.0K 156K
mkdir 8.0K 192K
mkfifo 8.0K 164K
mv 8.0K 220K
nice 8.0K 176K
nl 8.0K 512K
nohup 8.0K 184K
nproc - 160K
od 16K 140K
paste 8.0K 200K
printenv 8.0K 160K
ptx - 668K
pwd 8.0K 164K
readlink 12K 192K
realpath - 192K
relpath - 200K
rm 8.0K 208K
rmdir 8.0K 176K
seq 8.0K 228K
shuf - 224K
sleep 8.0K 200K
sort 28K 220K
split 8.0K 228K
stdbuf - 244K
sum 8.0K 180K
sync 8.0K 148K
tac - 192K
tail 12K 200K
tee 8.0K 204K
test 8.0K 112K
timeout - 264K
touch 8.0K 200K
tr 12K 196K
true 4.0K 80K
truncate - 196K
tsort 8.0K 216K
tty 8.0K 160K
uname 8.0K 164K
unexpand 8.0K 212K
uniq 8.0K 220K
unlink 8.0K 164K
uptime 12K 204K
users 8.0K 160K
wc 8.0K 192K
whoami 8.0K 160K
yes 4.0K 160K

I think the funniest one is nl, which is 6300% larger than the native nl. jemalloc would've added another 230K to each of these.

I realize some of this is Rust's fault: when an optimized, LTO'd, alloc_system'd fn main(){println!("Hi!\n");} is still 84K, there's not much room. For example from the object dump/disassembly, about 9% of that dead weight was panicking code & string literals for the standard library :\ If we're really condemned to that, and to an 80K hello world, with all the implied overhead (and it's clearly to scale, as seen above), then this raises serious doubts about Rust as a system language.

But surely we can shed some of the remaining 196K/216K/etc off of tr/tsort/friends? The median size of the native executables is 8.0K.

ebfe commented 8 years ago

That's one of the reasons the multicall binary exists. As for individual binaries, I'm not sure what we can do except trying to reduce the number of dependencies.

hexsel commented 8 years ago

The project could look into https://github.com/lrs-lang/lib, but it looks like a pretty big change and it may remove large amounts of cross-platform support.

ebfe commented 8 years ago

@hexel I did :) Unfortunately it's linux only.

nathanross commented 8 years ago

The elephants in the room are static linking, ABI compatibility, and as you mentioned libstd.

Some demonstrations. all "results" blocks are generated using the following command:

strip main && stat --printf="%s bytes\n" main && ldd main | cut -d= -f1

All of the results are going to be pretty specific to x86_64. I'm running debian stable.

All of the results below link against libc, the C standard library. You discuss suitability as a systems language. C is a system language, but most of its userland applications, like rust, use a standard library which takes up more than hundreds of kilobytes.

The difference as is demonstrated below, is that cargo by default does not use dynamic linking, because it's expected that most users will not (at this point in time) have an ABI-compatible rust standard library installed.

approaches

Rust, using libstd, static linking to libstd

command

echo 'fn main(){println!("Hello!\n");}' > main.rs
rustc -C opt-level=3 -C lto main.rs

results

290864 bytes
        linux-vdso.so.1 (0x00007fff6fd88000)
        libpthread.so.0 
        libgcc_s.so.1 
        libc.so.6 
        /lib64/ld-linux-x86-64.so.2 (0x00007f278dfd8000)

Rust, using libstd, dynamic linking to libstd

libstd is about 4.4mb and has to live on the operating system.

command

echo 'fn main(){println!("Hello!\n");}' > main.rs
rustc -C opt-level=3 -C prefer-dynamic main.rs

results

5528 bytes
        linux-vdso.so.1 (0x00007ffc9c70b000)
        libstd-17a8ccbd.so 
        libc.so.6 
        libdl.so.2 
        libpthread.so.0 
        libgcc_s.so.1 
        /lib64/ld-linux-x86-64.so.2 (0x00007f7e2c10c000)
        libm.so.6 
        librt.so.1 

Rust, no libstd framework

commands

wget https://raw.githubusercontent.com/rust-lang/rust/master/src/test/run-pass/smallest-hello-world.rs
rustc -C opt-level=3 -C lto -o main smallest-hello-world.rs

results

4992 bytes
        linux-vdso.so.1 (0x00007ffeae1bf000)
        libc.so.6 
        /lib64/ld-linux-x86-64.so.2 (0x00007f63f0832000)

C++, dynamic linking to libstdc++

Provided on my system by libgcc, which links directly against libm, so together they take up about 1.1mb total.

commands

echo -e '#include <iostream>\nint main() { std::cout << "Hello!\\n"; return 0; }' > main.cpp
clang++ -Oz main.cpp -o main

results

5400 bytes
        linux-vdso.so.1 (0x00007ffd4d58e000)
        libstdc++.so.6 
        libm.so.6 
        libgcc_s.so.1 
        libc.so.6 
        /lib64/ld-linux-x86-64.so.2 (0x00007f943e40b000)

C

commands

echo -e '#include <stdio.h>\nint main() { printf("Hello!\n"); }' > main.c
clang -Oz main.c -o main

results

4616 bytes
        linux-vdso.so.1 (0x00007fff4363f000)
        libc.so.6 
        /lib64/ld-linux-x86-64.so.2 (0x00007fc53febd000)

Can we just do everything dynamically

Imaginably if a distribution like debian or fedora has a libstdc++ that c++ programs can be compiled against, why not do this for rust?

Rust currently lacks ABI stability.

Well, it's the same and it isn't. C++ these uses a (relatively much more) stable ABI - it usually only changes with a major standards change. This means that when libstdc++ is compiled with a slightly newer version of clang you can install that on your system without also upgrading binaries that use libstdc++ and were compiled with an older version of clang.

When it has ABI stability, we'll be able to compete.

Rust doesn't have that yet, it's in the works as the Rust developers know it's needed to be able to ship binaries that are not so tightly coupled. But in the meantime, if you have a library that was complied with rustc v1.2 and you upgrade it and this new version is compiled with rustc v1.5, all binaries and libraries you were using that were linking against that library also now need to be replaced with versions of themselves compiled with rustc v1.5.

At some point in the future, there will be a stable ABI, and some systems will begin installing libstd as a dependency for some other tool. And for systems which have libstd, not only will the footprint of uutils coreutils per se is going to be a few hundred kb less, but we'll be able to painlessly split it into one binary per tool.

In the meantime

In the meantime, the best option is just as we're going right now - statically include parts of libstd that we need, then use a "multicall install" which imitates a binary for each tool we're providing via symbolic links.

In terms of speed, dynamic or static linking really has a negligible difference.

nathanross commented 8 years ago

creating linkback for #140

alexchandel commented 8 years ago

@nathanross Dynamic linking is not the answer, nor is multicall.

Rust programs already dynamically link to libSystem on OS X, which provides the entire C standard library plus a multitude of other features. The solution is not to dynamically link libstd.

The "features" libstd provides on top of libSystem are minimal—primarily structural—and in trivial programs ought to be removable. And indeed they can be, as LRS-lang demonstrates, but this requires undoing design flaws from Rust.

And multicall works rather poorly on windows. The solution instead is to judiciously code the binaries as close to the 80K minimum as possible.

vadimcn commented 8 years ago

@alexchandel: how would a C library help with providing e.g. Rust-style string formatting?

A viable solution on Windows might be multicall built as a dylib, plus a small stub binary that just calls the main entry point in that library. The latter could use #![no_std] to ensure the smallest possible size.

alexchandel commented 8 years ago

@vadimcn It doesn't need to, because string-formatting makes up relatively little of these binaries, and is inlined little with relative ease (once you stop panicking).

If you actually read the disassembly for PROFILE=release cp, you'll find that the largest symbols by far (at 34% of the text section) are __ZN4copy20h24e32c79ba610ccdJmaE and __ZN6uumain20ha15e8d6f5b9ddecfceaE. And you'll notice that there are calls to a huge number of symbols that panic in unoptimizable ways. Many of these are from show_error, which makes two unoptimizable writes that inexplicably panic instead of silently failing or aborting, but there are just as many explicit panics.

For comparison, I'm working on an ls for coreutils that never panics and obviously uses its own print/show_error macros, and it's barely 130K yet does far more than cp. I haven't even gotten around to optimizing it yet.

nathanross commented 8 years ago

fascinating @alexchandel your continued investigation into, and passion about, this topic is greatly appreciated.

bachp commented 6 years ago

@alexchandel Have you done any additional test since this issue was last discussed? It has been more than 2 years now and rust has changed quite a lot. Would be interesting to see how the binary size was influenced by this.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

06kellyjac commented 3 years ago

Some notes from playing with the size:

I added:

cargo-features = ["strip"]

# ...

[profile.release]
strip = "symbols"
opt-level = 'z'
lto = true
codegen-units = 1
panic = 'abort'
# commands I ended up using 
# make build-coreutils build-pkgs PROFILE=release
# fd . --type x --max-depth 1 ./target/release/ -x du {} | sort -k 2,2
# did some manual formatting to help with readability

3844  arch      -> 496  arch
4020  base32    -> 580  base32
4020  base64    -> 580  base64
3888  basename  -> 512  basename
3924  cat       -> 532  cat
3376  chgrp     -> 328  chgrp
3964  chmod     -> 544  chmod
3980  chown     -> 548  chown
3948  chroot    -> 548  chroot
3900  cksum     -> 516  cksum
3892  comm      -> 512  comm
14684 coreutils -> 4428 coreutils
4080  cp        -> 620  cp
5572  csplit    -> 1392 csplit
3964  cut       -> 552  cut
3976  date      -> 568  date
3972  df        -> 560  df
3388  dircolors -> 340  dircolors
3876  dirname   -> 508  dirname
3428  du        -> 376  du
3880  echo      -> 500  echo
3996  env       -> 556  env
3912  expand    -> 520  expand
3768  expr      -> 684  expr
3372  factor    -> 332  factor
3176  false     -> 232  false
3944  fmt       -> 540  fmt
3908  fold      -> 520  fold
3880  groups    -> 504  groups
5580  hashsum   -> 1432 hashsum
3980  head      -> 560  head
3272  hostid    -> 280  hostid
3916  hostname  -> 528  hostname
3916  id        -> 520  id
4028  install   -> 580  install
3944  join      -> 556  join
3892  kill      -> 516  kill
3864  link      -> 500  link
3916  ln        -> 532  ln
3860  logname   -> 504  logname
5880  ls        -> 1528 ls
3892  mkdir     -> 516  mkdir
3876  mkfifo    -> 504  mkfifo
3892  mknod     -> 512  mknod
4000  mktemp    -> 568  mktemp
3912  more      -> 528  more
3960  mv        -> 548  mv
3880  nice      -> 512  nice
5424  nl        -> 1316 nl
3888  nohup     -> 512  nohup
3884  nproc     -> 516  nproc
3936  numfmt    -> 548  numfmt
4048  od        -> 596  od
3892  paste     -> 516  paste
3884  pathchk   -> 512  pathchk
3948  pinky     -> 540  pinky
3872  printenv  -> 504  printenv
3320  printf    -> 312  printf
5560  ptx       -> 1412 ptx
3864  pwd       -> 504  pwd
3884  readlink  -> 516  readlink
3888  realpath  -> 516  realpath
3892  relpath   -> 516  relpath
3960  rm        -> 544  rm
3872  rmdir     -> 512  rmdir
3916  seq       -> 536  seq
3972  shred     -> 560  shred
3984  shuf      -> 560  shuf
3872  sleep     -> 512  sleep
4464  sort      -> 740  sort
3988  split     -> 568  split
3988  stat      -> 564  stat
7212  stdbuf    -> 868  stdbuf
3896  sum       -> 516  sum
3868  sync      -> 500  sync
3904  tac       -> 520  tac
3928  tail      -> 532  tail
3928  tee       -> 520  tee
3228  test      -> 260  test
3956  timeout   -> 548  timeout
3920  touch     -> 528  touch
3908  tr        -> 516  tr
3176  true      -> 232  true
3888  truncate  -> 516  truncate
3924  tsort     -> 524  tsort
3868  tty       -> 500  tty
3872  uname     -> 500  uname
3920  unexpand  -> 520  unexpand
3972  uniq      -> 560  uniq
3876  unlink    -> 504  unlink
3960  uptime    -> 564  uptime
3872  users     -> 504  users
3908  wc        -> 520  wc
3952  who       -> 540  who
3840  whoami    -> 496  whoami
3864  yes       -> 500  yes

Finished release [optimized] target(s) in 2m 04s -> Finished release [optimized] target(s) in 3m 41s

For multi call I tried out the performance of the size optimized version and it wasn't too bad but there's probably some tweaking to find a balance of size and performance

# make build-coreutils MULTICALL=y PROFILE=release

$ hyperfine --runs 8 --warmup 2 "/coreutils-8.32/bin/ls -al -R ./linux > /dev/null" "./coreutils_0 ls -al -R ./linux > /dev/null" "./coreutils_1 ls -al -R ./linux > /dev/null"
Benchmark #1: /coreutils-8.32/bin/ls -al -R ./linux > /dev/null
  Time (mean ± σ):     393.4 ms ±  15.9 ms    [User: 173.5 ms, System: 219.5 ms]
  Range (min … max):   377.5 ms … 421.1 ms    8 runs

Benchmark #2: ./coreutils_0 ls -al -R ./linux > /dev/null
  Time (mean ± σ):     410.2 ms ±  29.0 ms    [User: 265.8 ms, System: 144.1 ms]
  Range (min … max):   377.2 ms … 464.8 ms    8 runs

Benchmark #3: ./coreutils_1 ls -al -R ./linux > /dev/null
  Time (mean ± σ):     504.5 ms ±  39.4 ms    [User: 358.7 ms, System: 145.5 ms]
  Range (min … max):   465.0 ms … 586.2 ms    8 runs

Summary
  '/coreutils-8.32/bin/ls -al -R ./linux > /dev/null' ran
    1.04 ± 0.08 times faster than './coreutils_0 ls -al -R ./linux > /dev/null'
    1.28 ± 0.11 times faster than './coreutils_1 ls -al -R ./linux > /dev/null'

$ du coreutils_*
14684   coreutils_0
4428    coreutils_1
stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

06kellyjac commented 2 years ago

Still important

sylvestre commented 2 years ago

yeah, this is why I removed the wontfix ;)

06kellyjac commented 2 years ago

Ah sorry, I didnt see that update

sylvestre commented 2 years ago

no worries

sylvestre commented 1 year ago

So, i have been thinking about that. if you care about size, the symlink way is the right way.

For example ln -s /usr/bin/coreutils /usr/bin/ls will work

Example in Debian: https://salsa.debian.org/rust-team/debcargo-conf/-/blob/master/src/coreutils/debian/rust-coreutils.links

AFAIK, it works without issue.

I don't think there is a significantly better way to improve this.

06kellyjac commented 1 year ago

Agreed multicall helps a lot with the size but there are also improvements with those cargo settings I listed above

$ du coreutils_*
14684   coreutils_0
4428    coreutils_1

Should any of the settings be adopted?

sylvestre commented 1 year ago

Not yet Last time I checked, our version of rust/cargo were too old.

Would you like to try to submit a PR?

okias commented 10 months ago

I got to this place, because I wanted add date into Alpine container.. so I recalled there is great uutils and I could install them instead of regular coreutils, but what hit me is:

4.67 MB of uutils vs 1.02 MB of coreutils in Alpine.

Is this something which should be addressed on distribution level or here?

tertsdiepraam commented 10 months ago

@okias that depends on what Alpine is already doing. Are they using all the settings for --profile=release-small already? The rest would need to be solved here. Note however that probably there's a limit to what we can do.

omnivagant commented 10 months ago

@okias we didn't. With --profile=release-small we got (for x86_64)

>>> Size difference for uutils-coreutils: 4784 KiB -> 4192 KiB

but researching this I discovered --features=feat_os_unix_musl and together

>>> Size difference for uutils-coreutils: 4784 KiB -> 4428 KiB

it's still a bit smaller but perhaps not enough for the needs of @tertsdiepraam .

https://git.alpinelinux.org/aports/tree/testing/uutils-coreutils/APKBUILD

tertsdiepraam commented 10 months ago

Interesting, thanks! That's not quite enough indeed. This deserves some more investigation. However, I do want to set expectations, there's probably nothing we can do that will immediately cut the size to 1/4 of the current size.

Alright, so some questions first (these are both questions you might be able to answer and just open questions I want to investigate):

As a first data point, here's some output of cargo bloat of big crates:

 File  .text     Size Crate
 6.3%  16.6%   1.1MiB std
 2.2%   5.8% 387.4KiB regex_automata
 2.0%   5.2% 346.4KiB clap_builder
 1.7%   4.5% 304.2KiB uu_sort
 1.0%   2.7% 181.4KiB uucore
 1.0%   2.7% 179.3KiB uu_ls
 0.9%   2.4% 162.9KiB regex_syntax
 0.8%   2.1% 141.8KiB [Unknown]
 0.7%   1.9% 129.6KiB aho_corasick
 0.7%   1.8% 119.4KiB uu_tail
 0.6%   1.5% 100.6KiB uu_cp
 0.5%   1.4%  95.5KiB notify
 0.5%   1.3%  85.1KiB uu_pr
 0.4%   1.2%  78.6KiB clap_complete
 0.4%   1.1%  76.0KiB data_encoding
 0.4%   1.1%  75.8KiB uu_split
 0.4%   1.1%  72.4KiB hashbrown
 0.4%   1.0%  69.1KiB chrono
 0.4%   1.0%  68.6KiB uu_dd
 0.4%   1.0%  68.3KiB uu_ptx