Open tesuji opened 5 years ago
Interesting, thank you for reporting this.
First off, please always use the --warmup
option of hyperfine (or perform a cold-cache benchmark), see https://github.com/sharkdp/diskus#warm-disk-cache
But even with that out of the way, diskus
seems to be much slower here. I am assuming you did a normal release build (with optimizations) via cargo install --path .
?
What kind of disk is your folder on? Or is it mounted via network/etc.?
It could be related to the optimal number of threads. Could you please run this parametrized benchmark?
hyperfine -w5 -P threads 1 16 "diskus -j {threads}" --export-markdown /tmp/results.md
and post the content of /tmp/results.md
here?
By the way: if you want to see both tools report the exact same size, use du -sc -B1
and diskus
or, alternatively, use du -scb
and diskus -b
.
The following results are run on a different computer
% /usr/bin/du -sh
4.3G .
% diskus
4.52 GB (4,518,088,704 bytes)
use the
--warmup
option of hyperfine
I got a kind of similar result:
% hyperfine --warmup 5 'diskus' 'du -sh'
Benchmark #1: diskus
Time (mean ± σ): 102.5 ms ± 27.5 ms [User: 1.899 s, System: 0.551 s]
Range (min … max): 57.2 ms … 156.5 ms 21 runs
Benchmark #2: du -sh
Time (mean ± σ): 33.0 ms ± 2.9 ms [User: 12.1 ms, System: 20.9 ms]
Range (min … max): 25.3 ms … 36.9 ms 97 runs
Summary
'du -sh' ran
3.11 ± 0.88 times faster than 'diskus'
you did a normal release build
Yes, I did cargo build --release
.
What kind of disk is your folder on? Or is it mounted via network/etc.?
Honestly, I don't know either. It is a shared server. I could provide more info if you give instructions.
% df /home
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/lvm-home 2.7T 430G 2.2T 17% /home
% lsblk
NAME MOUNTPOINT LABEL SIZE UUID
sda 2.7T
├─sda1 1M
├─sda2 /boot 488M d3ad2fe7-0903-4329-a944-bd694a619fea
└─sda3 2.7T 5SIphy-NV0U-3NIU-VG4i-dVPD-AzaF-q61CjS
├─lvm-swap [SWAP] 1.9G b9df322f-b43c-4bd1-b214-d70f52abbd66
├─lvm-root / 19.5G 7a8bb72e-96e8-4280-945e-25d9a0931443
└─lvm-home /home 2.7T 314bd59b-e6e9-4ad5-a2db-ab581190c891
run this parametrized benchmark
Yes, here is the result:
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
diskus -j 1 |
36.9 ± 0.8 | 35.6 | 39.5 | 4.71 ± 0.47 |
diskus -j 2 |
19.3 ± 1.2 | 17.9 | 23.2 | 2.46 ± 0.28 |
diskus -j 3 |
13.9 ± 1.4 | 12.5 | 20.4 | 1.78 ± 0.25 |
diskus -j 4 |
11.3 ± 1.1 | 9.8 | 15.9 | 1.45 ± 0.20 |
diskus -j 5 |
9.9 ± 1.0 | 8.3 | 13.9 | 1.26 ± 0.18 |
diskus -j 6 |
8.9 ± 0.9 | 7.3 | 12.6 | 1.14 ± 0.16 |
diskus -j 7 |
8.4 ± 0.9 | 6.8 | 11.2 | 1.08 ± 0.16 |
diskus -j 8 |
8.1 ± 0.9 | 6.3 | 10.8 | 1.04 ± 0.15 |
diskus -j 9 |
7.9 ± 0.8 | 6.4 | 10.6 | 1.01 ± 0.14 |
diskus -j 10 |
7.9 ± 0.7 | 6.4 | 10.3 | 1.01 ± 0.14 |
diskus -j 11 |
7.9 ± 0.7 | 6.3 | 10.2 | 1.00 ± 0.13 |
diskus -j 12 |
7.8 ± 0.8 | 6.4 | 10.9 | 1.00 |
diskus -j 13 |
7.9 ± 0.6 | 6.5 | 9.9 | 1.01 ± 0.13 |
diskus -j 14 |
8.2 ± 1.3 | 6.5 | 14.5 | 1.05 ± 0.19 |
diskus -j 15 |
8.1 ± 0.8 | 6.5 | 14.4 | 1.03 ± 0.14 |
diskus -j 16 |
8.7 ± 0.9 | 7.1 | 14.8 | 1.11 ± 0.16 |
Wait, so diskus is much faster with a correctly set number of threads?
Do you happen to have a massive number of CPU cores? What does
nproc
say?
% nproc
32
Oh :smile: That seems to be the cause of this. By default, diskus
uses 3 * nproc
threads to walk the filesystem (96 in your case). It seems like this heuristic doesn't hold for a large number of cores.
We should probably cap it at some value (32?). If you have the time, it would be great if you could run the full benchmark up to 96 threads and post the JSON results here:
hyperfine -w5 -P threads 1 96 "diskus -j {threads}" --export-json /tmp/results.json
I would assume that the time slowly increases to 100 ms when the number of threads gets higher.
(a cold cache benchmark for comparison would also be great, but I don't want to bother you).
a cold cache benchmark
Sorry, I couldn't make it because it is a shared server. I don't have sudo privilege.
run the full benchmark up to 96 threads
Here is the result: https://gist.github.com/lzutao/7b86122495608f9096ac692553e2a038
Ok, so the warm-cache runtime looks like this:
Choosing 96 threads by default is obviously not optimal here.
Ok, so the warm-cache runtime looks like this:
Choosing 96 threads by default is obviously not optimal here.
This looks to me like the cores stalling, because they cant get memory access. Is this OS restricted ??
Memory restrictions usually look like an exponential curve to a limit (loading condensator), but this seems to be linear from the minimum to the limit.
Is this OS restricted ??
That might be one reason. But I would rather guess that we are simply limited by the sequential nature of the disk (and cache) itself. There is a certain benefit in bombarding the IO scheduler with lots of requests (that is why we use multiple threads in the first place), but at some point the synchronization/context-switching overhead is probably just too high.
There is no really solid basis for the 3 * nproc
heuristic that diskus
uses. It's just something that seemed to work fine for all the machines I tested on. Things are complicated by the fact that the optimal number of threads is different for warm-cache and cold-cache runs. The 3 * nproc
value was a tradeoff between the two:
(results from a 10GB folder on my 8-core laptop, warm-cache and cold-cache results normalized independently)
Is this OS restricted ??
That might be one reason. But I would rather guess that we are simply limited by the sequential nature of the disk (and cache) itself. There is a certain benefit in bombarding the IO scheduler with lots of requests (that is why we use multiple threads in the first place), but at some point the synchronization/context-switching overhead is probably just too high.
There is sysinfo with disktype.
At least you could adapt to HDD and SSD speeds, but I am not sure if it is worth it.
Sadly they provide no method to obtain read/write speed for the disk and caches, because around 5% speedup from using the exact block size should be expected.
To my knowledge there exists no simple CPU-model to estimate context-switches and synchronization(on cache invalidations) etc, which is a shame, but expected regarding Spectre and similar. If you know otherwise, please tell me.
There is no really solid basis for the
3 * nproc
heuristic thatdiskus
uses. It's just something that seemed to work fine for all the machines I tested on. Things are complicated by the fact that the optimal number of threads is different for warm-cache and cold-cache runs. The3 * nproc
value was a tradeoff between the two:
Thanks.
Is this OS restricted ??
~Yeah, likely, max open files is soft-limited to 1024~. Forget about what I said above, I ran the command on the wrong machine. Here is the new result:
% cat /proc/$$/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 579972 579972 processes
Max open files 1024 1048576 files
Max locked memory 67108864 67108864 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 579972 579972 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
Maybe I were doing it wrong. The computed directory is my clippy build.
Meta
cargo update