sharkdp / fd

A simple, fast and user-friendly alternative to 'find'
Apache License 2.0
33.43k stars 799 forks source link

3x~10x Performance regression between 7.2.0 and >7.3.0 on large folder #980

Open peter50216 opened 2 years ago

peter50216 commented 2 years ago

Noticed that some fd commends runs much slower (10x slower) when I upgraded my local fd from 6.2.0 to newest 8.3.2, and did a quick version bisect.

Looks like the regression is between 7.2.0 and 7.3.0, and all version I've tested after 7.3.0 (7.4.0, 7.5.0, 8.0.0, 8.1.1, 8.3.2) are all as about the same speed as 7.3.0.

Reproduce script:

set -e

wget -q https://github.com/sharkdp/fd/releases/download/v7.2.0/fd-v7.2.0-x86_64-unknown-linux-musl.tar.gz
tar -xf fd-v7.2.0-x86_64-unknown-linux-musl.tar.gz

wget -q https://github.com/sharkdp/fd/releases/download/v7.3.0/fd-v7.3.0-x86_64-unknown-linux-musl.tar.gz
tar -xf fd-v7.3.0-x86_64-unknown-linux-musl.tar.gz

hyperfine --version
hyperfine \
  --warmup 5 \
  './fd-v7.2.0-x86_64-unknown-linux-musl/fd ".*camera_hal.*" ~/chromiumos/src' \
  './fd-v7.3.0-x86_64-unknown-linux-musl/fd ".*camera_hal.*" ~/chromiumos/src'

(I'm using Chrome OS source tree as an example here, but I can reproduce similar regression on other large source tree, for example, linux source tree)

Result:

Benchmark #2: ./fd-v7.3.0-x86_64-unknown-linux-musl/fd ".camera_hal." ~/chromiumos/src Time (mean ± σ): 25.529 s ± 0.328 s [User: 222.856 s, System: 1924.844 s] Range (min … max): 24.980 s … 26.091 s 10 runs

Summary './fd-v7.2.0-x86_64-unknown-linux-musl/fd ".camera_hal." ~/chromiumos/src' ran 10.34 ± 0.28 times faster than './fd-v7.3.0-x86_64-unknown-linux-musl/fd ".camera_hal." ~/chromiumos/src'


* On my local laptop with SSD, with 4 cores/8 hyperthreads:

hyperfine 1.13.0 Benchmark 1: ./fd-v7.2.0-x86_64-unknown-linux-musl/fd ".camera_hal." ~/chromiumos/src Time (mean ± σ): 2.348 s ± 0.101 s [User: 10.347 s, System: 6.298 s] Range (min … max): 2.237 s … 2.527 s 10 runs

Benchmark 2: ./fd-v7.3.0-x86_64-unknown-linux-musl/fd ".camera_hal." ~/chromiumos/src Time (mean ± σ): 6.882 s ± 0.090 s [User: 44.010 s, System: 6.813 s] Range (min … max): 6.783 s … 7.065 s 10 runs

Summary './fd-v7.2.0-x86_64-unknown-linux-musl/fd ".camera_hal." ~/chromiumos/src' ran 2.93 ± 0.13 times faster than './fd-v7.3.0-x86_64-unknown-linux-musl/fd ".camera_hal." ~/chromiumos/src'



Also tried adding `--color=never` and the result are similar to this, from the changelog the only other suspect is the `--exec-batch` command?

Happy to provide additional testing / debug info if needed.
tavianator commented 2 years ago

I can reproduce that here, but with -j1 the performance is the same. I think this is https://github.com/sharkdp/fd/issues/710, and the cause is just the musl version being upgraded as a result of Rust being updated. Or maybe this is around when Rust stopped using jemalloc by default.

See also

peter50216 commented 2 years ago

Tested with the gnu version instead of musl, and verified that this is specific to musl.

Benchmark #2: ./fd-v7.3.0-x86_64-unknown-linux-gnu/fd ".camera_hal." ~/chromiumos/src Time (mean ± σ): 2.947 s ± 0.065 s [User: 138.492 s, System: 49.916 s] Range (min … max): 2.851 s … 3.046 s 10 runs

Summary './fd-v7.2.0-x86_64-unknown-linux-gnu/fd ".camera_hal." ~/chromiumos/src' ran 1.21 ± 0.05 times faster than './fd-v7.3.0-x86_64-unknown-linux-gnu/fd ".camera_hal." ~/chromiumos/src'



There's still a slowdown of ~1.2x, which is probably caused by Rust stopped using jemalloc by default as you said, and jemalloc being faster in this use case than glibc malloc?

I think this is covered by #710 anyway, so feel free to close this as duplicate.
sharkdp commented 2 years ago

Thank you for reporting this anyway!

See also: https://dev.to/sharkdp/an-unexpected-performance-regression-11ai

Back then, the performance regression was between 7.0 and 7.1, so that doesn't quite fit with your results. You can easily check if a particular fd executable uses jemalloc by doing something like

strings <fd-executable> | grep jemalloc
peter50216 commented 2 years ago

Did a quick grep from binaries downloaded from https://github.com/sharkdp/fd/releases:

Using jemalloc:

Not using jemalloc:

Looks like the patch to use jemalloc in 7.4.0 is not applied to musl build (which is also stated in the 7.4.0 release notes).

peter50216 commented 2 years ago

Also tried building musl + jemalloc on the master branch (c577b0838b2e), with cross build --target=x86_64-unknown-linux-musl (https://github.com/gnzlbg/jemallocator/issues/124#issuecomment-486561511), and the performance is much better than the non-jemalloc version:

Benchmark #1: ~/temp/fd-musl-no-jemalloc ".*camera_hal.*" ~/chromiumos/src
  Time (mean ± σ):     18.901 s ±  0.281 s    [User: 166.882 s, System: 1532.500 s]
  Range (min … max):   18.467 s … 19.252 s    10 runs

Benchmark #2: ~/temp/fd-musl-jemalloc ".*camera_hal.*" ~/chromiumos/src
  Time (mean ± σ):      4.614 s ±  0.570 s    [User: 26.295 s, System: 361.069 s]
  Range (min … max):    3.435 s …  5.445 s    10 runs

Summary
  '~/temp/fd-musl-jemalloc ".*camera_hal.*" ~/chromiumos/src' ran
    4.10 ± 0.51 times faster than '~/temp/fd-musl-no-jemalloc ".*camera_hal.*" ~/chromiumos/src'

So it might be worthwhile to enable jemalloc for musl build too. (From a quick glance at the github action the musl version is already building with cross, so there shouldn't be any build issue)

It's still slower than 7.2.0 but that's likely #599.