sharkdp / diskus

A minimal, fast alternative to 'du -sh'
Apache License 2.0
1.02k stars 35 forks source link

diskus slower than du #38

Open tesuji opened 5 years ago

tesuji commented 5 years ago

Maybe I were doing it wrong. The computed directory is my clippy build.

% /usr/bin/du -sch
4.5G    .
4.5G    total
% diskus
4.73 GB (4,727,521,280 bytes)
% hyperfine diskus '/usr/bin/du -sch'
Benchmark #1: diskus
  Time (mean ± σ):     115.8 ms ±  28.6 ms    [User: 2.601 s, System: 0.592 s]
  Range (min … max):    69.1 ms … 156.9 ms    19 runs

Benchmark #2: /usr/bin/du -sch
  Time (mean ± σ):      22.8 ms ±   2.8 ms    [User: 5.5 ms, System: 17.4 ms]
  Range (min … max):    14.2 ms …  26.9 ms    163 runs

Summary
  '/usr/bin/du -sch' ran
    5.07 ± 1.40 times faster than 'diskus'

Meta

sharkdp commented 5 years ago

Interesting, thank you for reporting this.

First off, please always use the --warmup option of hyperfine (or perform a cold-cache benchmark), see https://github.com/sharkdp/diskus#warm-disk-cache

But even with that out of the way, diskus seems to be much slower here. I am assuming you did a normal release build (with optimizations) via cargo install --path .?

What kind of disk is your folder on? Or is it mounted via network/etc.?

It could be related to the optimal number of threads. Could you please run this parametrized benchmark?

hyperfine -w5 -P threads 1 16 "diskus -j {threads}" --export-markdown /tmp/results.md

and post the content of /tmp/results.md here?

By the way: if you want to see both tools report the exact same size, use du -sc -B1 and diskus or, alternatively, use du -scb and diskus -b.

tesuji commented 5 years ago

The following results are run on a different computer

% /usr/bin/du -sh
4.3G    .
% diskus
4.52 GB (4,518,088,704 bytes)

use the --warmup option of hyperfine

I got a kind of similar result:

% hyperfine --warmup 5 'diskus' 'du -sh'
Benchmark #1: diskus
  Time (mean ± σ):     102.5 ms ±  27.5 ms    [User: 1.899 s, System: 0.551 s]
  Range (min … max):    57.2 ms … 156.5 ms    21 runs

Benchmark #2: du -sh
  Time (mean ± σ):      33.0 ms ±   2.9 ms    [User: 12.1 ms, System: 20.9 ms]
  Range (min … max):    25.3 ms …  36.9 ms    97 runs

Summary
  'du -sh' ran
    3.11 ± 0.88 times faster than 'diskus'

you did a normal release build

Yes, I did cargo build --release.

What kind of disk is your folder on? Or is it mounted via network/etc.?

Honestly, I don't know either. It is a shared server. I could provide more info if you give instructions.

% df /home
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/lvm-home  2.7T  430G  2.2T  17% /home
% lsblk
NAME         MOUNTPOINT LABEL  SIZE UUID
sda                            2.7T
├─sda1                           1M
├─sda2       /boot             488M d3ad2fe7-0903-4329-a944-bd694a619fea
└─sda3                         2.7T 5SIphy-NV0U-3NIU-VG4i-dVPD-AzaF-q61CjS
  ├─lvm-swap [SWAP]            1.9G b9df322f-b43c-4bd1-b214-d70f52abbd66
  ├─lvm-root /                19.5G 7a8bb72e-96e8-4280-945e-25d9a0931443
  └─lvm-home /home             2.7T 314bd59b-e6e9-4ad5-a2db-ab581190c891

run this parametrized benchmark

Yes, here is the result:

Command Mean [ms] Min [ms] Max [ms] Relative
diskus -j 1 36.9 ± 0.8 35.6 39.5 4.71 ± 0.47
diskus -j 2 19.3 ± 1.2 17.9 23.2 2.46 ± 0.28
diskus -j 3 13.9 ± 1.4 12.5 20.4 1.78 ± 0.25
diskus -j 4 11.3 ± 1.1 9.8 15.9 1.45 ± 0.20
diskus -j 5 9.9 ± 1.0 8.3 13.9 1.26 ± 0.18
diskus -j 6 8.9 ± 0.9 7.3 12.6 1.14 ± 0.16
diskus -j 7 8.4 ± 0.9 6.8 11.2 1.08 ± 0.16
diskus -j 8 8.1 ± 0.9 6.3 10.8 1.04 ± 0.15
diskus -j 9 7.9 ± 0.8 6.4 10.6 1.01 ± 0.14
diskus -j 10 7.9 ± 0.7 6.4 10.3 1.01 ± 0.14
diskus -j 11 7.9 ± 0.7 6.3 10.2 1.00 ± 0.13
diskus -j 12 7.8 ± 0.8 6.4 10.9 1.00
diskus -j 13 7.9 ± 0.6 6.5 9.9 1.01 ± 0.13
diskus -j 14 8.2 ± 1.3 6.5 14.5 1.05 ± 0.19
diskus -j 15 8.1 ± 0.8 6.5 14.4 1.03 ± 0.14
diskus -j 16 8.7 ± 0.9 7.1 14.8 1.11 ± 0.16
sharkdp commented 5 years ago

Wait, so diskus is much faster with a correctly set number of threads?

Do you happen to have a massive number of CPU cores? What does

nproc

say?

tesuji commented 5 years ago
% nproc
32
sharkdp commented 5 years ago

Oh :smile: That seems to be the cause of this. By default, diskus uses 3 * nproc threads to walk the filesystem (96 in your case). It seems like this heuristic doesn't hold for a large number of cores.

We should probably cap it at some value (32?). If you have the time, it would be great if you could run the full benchmark up to 96 threads and post the JSON results here:

hyperfine -w5 -P threads 1 96 "diskus -j {threads}" --export-json /tmp/results.json

I would assume that the time slowly increases to 100 ms when the number of threads gets higher.

(a cold cache benchmark for comparison would also be great, but I don't want to bother you).

tesuji commented 5 years ago

a cold cache benchmark

Sorry, I couldn't make it because it is a shared server. I don't have sudo privilege.

run the full benchmark up to 96 threads

Here is the result: https://gist.github.com/lzutao/7b86122495608f9096ac692553e2a038

sharkdp commented 5 years ago

Ok, so the warm-cache runtime looks like this:

scaling

Choosing 96 threads by default is obviously not optimal here.

matu3ba commented 4 years ago

Ok, so the warm-cache runtime looks like this:

scaling

Choosing 96 threads by default is obviously not optimal here.

This looks to me like the cores stalling, because they cant get memory access. Is this OS restricted ??
Memory restrictions usually look like an exponential curve to a limit (loading condensator), but this seems to be linear from the minimum to the limit.

sharkdp commented 4 years ago

Is this OS restricted ??

That might be one reason. But I would rather guess that we are simply limited by the sequential nature of the disk (and cache) itself. There is a certain benefit in bombarding the IO scheduler with lots of requests (that is why we use multiple threads in the first place), but at some point the synchronization/context-switching overhead is probably just too high.

There is no really solid basis for the 3 * nproc heuristic that diskus uses. It's just something that seemed to work fine for all the machines I tested on. Things are complicated by the fact that the optimal number of threads is different for warm-cache and cold-cache runs. The 3 * nproc value was a tradeoff between the two:

image (results from a 10GB folder on my 8-core laptop, warm-cache and cold-cache results normalized independently)

matu3ba commented 4 years ago

Is this OS restricted ??

That might be one reason. But I would rather guess that we are simply limited by the sequential nature of the disk (and cache) itself. There is a certain benefit in bombarding the IO scheduler with lots of requests (that is why we use multiple threads in the first place), but at some point the synchronization/context-switching overhead is probably just too high.

There is sysinfo with disktype.
At least you could adapt to HDD and SSD speeds, but I am not sure if it is worth it. Sadly they provide no method to obtain read/write speed for the disk and caches, because around 5% speedup from using the exact block size should be expected.

To my knowledge there exists no simple CPU-model to estimate context-switches and synchronization(on cache invalidations) etc, which is a shame, but expected regarding Spectre and similar. If you know otherwise, please tell me.

There is no really solid basis for the 3 * nproc heuristic that diskus uses. It's just something that seemed to work fine for all the machines I tested on. Things are complicated by the fact that the optimal number of threads is different for warm-cache and cold-cache runs. The 3 * nproc value was a tradeoff between the two:

Thanks.

tesuji commented 4 years ago

Is this OS restricted ??

~Yeah, likely, max open files is soft-limited to 1024~. Forget about what I said above, I ran the command on the wrong machine. Here is the new result:

% cat /proc/$$/limits
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             579972               579972               processes 
Max open files            1024                 1048576              files     
Max locked memory         67108864             67108864             bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       579972               579972               signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us