Closed maxsu closed 5 years ago
Unless you're running on a system with an EXTREMELY slow CPU or using a secure hash for an application that does not need it, hashing is not a bottleneck. If using xxhash (jdupes does), there's definitely no reason to offload to the GPU. There may be some merit for secure hash algorithms, but again, the algorithm would have to be a bottleneck. On many modern systems, there are multiple threads and cores, even on low-end Atom and Celeron chips, so just going multi-threaded would be both easier and more useful. GPGPU is also a somewhat difficult thing to add to a program, often with OS-specific and possibly vendor-specific dependencies.
In my tests, both the hash algorithm I wrote and the xxhash algorithm are so fast that crunching multiple gigabytes of data that's already in the disk cache seems nearly instantaneous on the systems I usually test on (AMD Phenom II X6 1035T, AMD A8-9600, and better). The few tests I've run on weak machines are not much different. These fast non-secure hash algorithms are so small that they fit in the L1 I-cache of the CPU and they don't iterate over the data several times.
I'm not going to do it for dupd, but here's a run of cachegrind for jdupes over a pile of git clones I've done, as an example; note that the XXH64 functions used about 4-5x the instructions of the memory allocator I wrote:
--------------------------------------------------------------------------------
I1 cache: 65536 B, 64 B, 2-way associative
D1 cache: 65536 B, 64 B, 2-way associative
LL cache: 6291456 B, 64 B, 48-way associative
Command: ./jdupes -nrq /omitted/git
Data file: cachegrind.out.17132
Events recorded: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Events shown: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Thresholds: 0.1 100 100 100 100 100 100 100 100
Include dirs:
User annotated:
Auto-annotation: on
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
--------------------------------------------------------------------------------
218,016,407 24,289 1,336 52,196,122 846,204 7,679 28,926,733 134,551 97,096 PROGRAM TOTALS
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw file:function
--------------------------------------------------------------------------------
34,436,850 1 1 4,919,550 0 0 0 0 0 /omitted/jdupes/xxhash.c:XXH64_round
33,632,462 257 5 1,555,632 6 0 9,914,412 0 0 /omitted/jdupes/xxhash.c:XXH64_update
29,127,815 13 13 6,657,750 449,466 1,095 4,050,399 15,205 13,158 /omitted/jdupes/jdupes.c:checkmatch
27,365,839 15,386 368 8,170,625 114,447 2,532 7,068,565 58,038 27,130 ???:???
16,448,144 1,535 8 1,962,206 714 0 278,820 9,733 8,734 /omitted/jdupes/string_malloc.c:string_malloc
14,616,698 3 3 4,414,360 114,670 413 0 0 0 /omitted/jdupes/jdupes.c:check_conditions
9,667,156 1 1 9,667,156 20,151 428 0 0 0 /omitted/jdupes/xxhash.c:XXH_read64
5,913,170 17 17 2,634,297 47,206 1 353,263 16 1 /omitted/jdupes/jdupes.c:grokdir
2,657,256 8 8 1,006,142 53 0 422,892 0 0 ???:_IO_un_link
2,609,677 8 8 533,169 3,952 20 0 0 0 /omitted/jdupes/jody_sort.c:numeric_sort
2,458,072 9 9 872,219 3,639 1 475,756 0 0 ???:_IO_link_in
2,329,139 5 5 320,320 0 0 289,635 0 0 /omitted/jdupes/xxhash.c:XXH64_digest
2,155,768 5 5 630,978 430 0 157,745 258 0 ???:malloc
2,104,160 8 8 435,057 272 0 435,028 380 0 /omitted/jdupes/jdupes.c:get_filehash
1,955,820 7 7 713,610 199 0 264,300 0 0 ???:fclose
1,850,164 206 8 290,741 1,538 1 211,447 0 0 ???:_IO_file_fopen
1,829,618 4 4 256,903 529 0 18,837 0 0 /omitted/jdupes/string_malloc.c:string_free
1,688,828 9 9 566,196 297 1 316,354 0 0 /omitted/jdupes/jdupes.c:check_singlefile
1,646,592 5 5 571,862 487 0 210,686 0 0 ???:fread
1,642,741 33 33 561,374 3,332 15 354,072 141 0 ???:vfprintf
1,613,487 3 3 189,822 9 0 759,288 46,286 46,117 /omitted/jdupes/jdupes.c:init_newfile
1,608,733 4 4 665,546 4,660 0 254,868 20 0 ???:readdir
1,515,784 6 6 338,397 32 1 116,786 19 0 ???:_IO_file_xsputn
1,347,970 4 2 370,030 0 0 343,600 384 0 ???:_IO_setb
1,268,640 5 5 264,300 32 0 422,880 0 0 ???:_IO_file_close_it
1,110,113 4 4 237,881 6 0 158,586 921 2 ???:_IO_file_doallocate
945,692 0 0 171,944 0 0 171,944 0 0 /omitted/jdupes/xxhash.c:XXH64_mergeRound
892,086 3 3 137,244 0 0 457,480 165 0 /omitted/jdupes/xxhash.c:XXH64_reset
845,781 2 2 158,585 0 0 185,014 0 0 ???:_IO_file_open
807,967 35 35 303,350 3,779 78 148,068 65 2 /omitted/jdupes/jdupes.c:main
788,907 5,342 2 157,791 5,120 0 48 0 0 ???:free
776,464 124 5 291,039 1,196 0 194,200 0 0 ???:_IO_file_underflow
756,224 12 12 100,763 3,477 755 0 0 0 ???:strlen
426,922 4 3 194,050 0 0 38,798 0 0 ???:__underflow
388,100 2 2 135,835 0 0 116,430 0 0 ???:_IO_switch_to_get_mode
370,020 2 2 105,720 0 0 26,430 0 0 ???:_IO_default_finish
369,400 13 13 84,962 0 0 77,574 0 0 ???:_IO_file_seekoff
348,898 1 1 34,891 0 0 2 0 0 ???:__xstat
343,603 4 2 158,586 51 0 52,862 0 0 ???:_IO_doallocbuf
343,603 1 1 105,724 0 0 132,155 0 0 ???:_IO_file_init
316,360 1 1 0 0 0 63,272 4 0 /usr/include/sys/stat.h:check_singlefile
316,350 1 1 31,635 0 0 0 0 0 ???:__lxstat
264,310 2 2 26,431 0 0 0 0 0 ???:__fxstat
237,960 2 2 23,796 19 1 0 0 0 ???:strchrnul
236,099 6 2 67,456 0 0 0 0 0 ???:read
--- snip ---
For comparison, fdupes (from which jdupes was forked) still uses MD5, and this is the cachegrind for the same command:
--------------------------------------------------------------------------------
I1 cache: 65536 B, 64 B, 2-way associative
D1 cache: 65536 B, 64 B, 2-way associative
LL cache: 6291456 B, 64 B, 48-way associative
Command: ./fdupes -nrqm /omitted/git/
Data file: cachegrind.out.19190
Events recorded: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Events shown: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Thresholds: 0.1 100 100 100 100 100 100 100 100
Include dirs:
User annotated:
Auto-annotation: on
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
--------------------------------------------------------------------------------
546,014,351 24,074 1,451 78,006,790 1,146,918 16,632 33,567,351 155,541 93,820 PROGRAM TOTALS
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw file:function
--------------------------------------------------------------------------------
380,748,368 86 31 34,613,488 35,454 128 8,653,372 33 0 /omitted/git/fdupes/md5/md5.c:md5_process
54,679,755 17,166 439 13,339,153 96,975 5,614 10,742,075 137,111 89,126 ???:???
22,159,448 12 12 6,469,232 168,193 360 4,555,443 243 0 /omitted/git/fdupes/fdupes.c:checkmatch
8,287,715 4 4 727,397 4 0 1,410,438 0 0 /omitted/git/fdupes/md5/md5.c:md5_append
7,675,492 18 18 4,855,109 2,364 1 583,192 4,398 4,217 /omitted/git/fdupes/fdupes.c:grokdir
7,520,576 2 2 752,060 0 0 4 0 0 ???:__xstat
6,249,067 2 2 3,259,506 614,384 731 0 0 0 /omitted/git/fdupes/fdupes.c:is_hardlink
6,086,394 5 5 1,781,370 300 0 445,343 60 1 ???:malloc
4,635,360 0 0 489,288 0 0 412,032 0 0 ???:md5_finish
3,940,056 2 2 231,768 0 0 309,024 0 0 /omitted/git/fdupes/md5/md5.c:md5_finish
3,178,727 2 2 1,271,488 1,203 0 0 0 0 /omitted/git/fdupes/fdupes.c:filesize
3,178,725 0 0 0 0 0 635,745 8,662 0 /usr/include/sys/stat.h:filesize
2,987,232 1 1 437,784 0 0 412,032 0 0 /omitted/git/fdupes/fdupes.c:md5copy
2,950,640 8 8 1,117,561 689 0 469,792 0 0 ???:_IO_un_link
2,730,666 9 9 968,946 19,052 1 528,516 1 0 ???:_IO_link_in
2,442,018 5,478 3 537,618 46,503 2,823 0 0 0 ???:free
2,172,788 7 7 792,774 462 0 293,620 0 0 ???:fclose
2,055,340 142 7 322,982 261 1 234,896 1 0 ???:_IO_file_fopen
1,847,454 5 5 642,409 291 0 236,677 0 0 ???:fread
1,618,090 4 4 669,395 3,231 0 256,352 62 0 ???:readdir
1,497,502 2 2 411,078 0 0 381,716 830 0 ???:_IO_setb
1,472,039 3 3 128,760 324 0 284,126 2 0 /omitted/git/fdupes/fdupes.c:getcrcsignatureuntil
1,409,376 5 5 293,620 32 0 469,792 0 0 ???:_IO_file_close_it
1,233,257 4 4 264,269 172 0 176,178 2,578 0 ???:_IO_file_doallocate
1,136,792 0 0 274,022 17,911 0 0 0 0 /omitted/git/fdupes/fdupes.c:md5cmp
939,584 2 2 176,172 1 0 205,534 0 0 ???:_IO_file_open
894,006 205 5 335,031 473 0 223,590 0 0 ???:_IO_file_underflow
877,321 21 21 323,465 68,598 1,556 182,190 1 1 /omitted/git/fdupes/fdupes.c:main
771,252 14 14 103,464 3,978 2,337 0 0 0 ???:strlen
--- snip ---
Just for fun, here's a time comparison with all those files now in the cache (on the Phenom II X6 1035T, so an older DDR2 RAM machine, none of this new fast fun stuff) and the total number of path elements:
$ find ~/git/ | wc -l
31800
$ time ./fdupes -nrqm ~/git/
1805 duplicate files (in 1105 sets), occupying 2.9 megabytes
real 0m0.919s
user 0m0.196s
sys 0m0.720s
$ time ./jdupes -nrqm ~/git/
1805 duplicate files (in 1105 sets), occupying 2 MB
real 0m0.308s
user 0m0.060s
sys 0m0.245s
Ultimately, I think that GPGPU acceleration of the hashing will not help, especially since these numbers are for single-threaded programs.
One last note: fdupes performance shown is not typical because most users won't silence the progress with -q
which is the biggest bottleneck in the program if you're using SSH (as I am) due to printf()
family functions blocking and a call to printf()
for every single item touched.
Thanks for the references to the papers, I'll read them when I have some time.
I agree with @jbruchon. There might be some combination of file set, CPU and storage media where hashing speed is a bottleneck but that will be quite rare.
In the case where the file data is being read from disk, hashing speed is not a factor because nearly all the run time is spent on I/O.
In the case where file data is read from memory cache, then the hashing speed becomes interesting. You can experiment with the -F flag using the supported hash algorithms (md5 sha1 sha512 xxhash).
If you use sha512 and then compare to a run with md5 you'll generally see faster results. That tells us sha512 is a bottleneck and going with a faster hash (md5) helped.
If you then compare runs with xxhash, it is generally a bit faster but not by a huge margin. According to https://github.com/Cyan4973/xxHash, xxhash is more than an order of magnitude faster than MD5, but the scan speedup is a small fraction of that. Therefore, switching to a hashing algorithm faster than xxhash is unlikely to produce any additional benefit.
Nonetheless, it might be fun to try just for the experiment. It may well be slower due to the overhead of having to copy the data to the GPU.
I don't own a suitable GPU though, so this isn't something I can test.
Thanks for the responses and analysis. It does seem that it would be ideal to offload hashing to a secondary thread, but with a purported xxhash rate of >5 gb/s I agree that there would be minimal benefit.
Darn.
@jbruchon, on a side note, due to my own ignorance about profiling I'm not sure how cachegrind lets us determine
how much time the program spends waiting around for hashes to compute?
how long does the program spend simply waiting for disk?
I've replicated the experiment locally, and have done a little reading on cachegrind. I understand that effectively the rate of cache misses should have a calculable impact on runtime, but I'm not sure how to calculate appropriate coefficients or estimate distortion introduced by the instrumentation.
Currently gearing up to run dupd / jdupes in a sampling profiler (vtune), in hope that the result will be clearer to interpret - this is my first 'serious' interaction with a C profiler and I'm excited AND intimidated! However, I'd love to learn more about interpreting cache miss statistics - could you comment on proper interpretation and whether we can effectively intuit the above items (perhaps in another forum)?
@maxsu Cachegrind tells us how many CPU instructions were spent and where they were spent (the Ir
counts) and how many data accesses (read or write) were performed (the Dr
counts). The proposition was to offload the work to a GPU to take advantage of the mass parallelization offered by a GPU. For this to matter, some hidden assumptions must be valid:
We can't assume that a platform has a powerful CPU, but most dev machines do, so let's just assume the program is on an Atom D525 or similarly wimpy CPU and that there is a CPU resource bottleneck. We use Cachegrind to find out where our CPU instructions are being used the most. (Forget the cache hit rates, because GPGPU is for performing massively more processing instructions in one second, not for improving cache hit rates.) Cachegrind tells us that the XXH64_*
stuff is, in aggregate, using most of the CPU that is being used. However, looking at other functions that have "high" usage in the program, my string_malloc
allocator is right up there in the same pile, as is the heaviest function of the program, checkmatch()
, which shuffles entire files worth of data around. I wrote string_malloc
to reduce malloc()
call overhead for the huge amount of small memory allocations that jdupes
performs (largely because of fdupes
code that already did those allocations and momentum, not because spamming malloc/free
is a good idea for something that fits in a known sized space and is likely to be discarded). I know that string_malloc
is very lightweight code. If it's only using 1/4 the CPU instructions compared to the core XXH64_*
functions combined, then those must not be very heavy either. Indeed, much of my jdupes
work has been on streamlining the code to reduce bottlenecks and unnecessary calls and functions.
For the same data set, fdupes
spends 5+ times more instructions in md5_process()
alone than in the two core XXH64_*
calls from jdupes
combined. fdupes
is far more likely to suffer a serious CPU bottleneck because of the unnecessarily massive hash function, and the numbers illustrate that quite well. It's possible that fdupes
would benefit from a parallel MD5, but jdupes
executes so few instructions that there's no way it would benefit. Keep in mind that even fdupes
ran in less than a second on the test system for a data set of 31K files and 1.8K duplicates. With caches dropped, runtime explodes beyond a minute due to disk thrashing. I/O overhead beats CPU overhead by orders of magnitude. Cachegrind really isn't needed to see this, either.
tl;dr: no need to be precise or do math because simple heuristics tell us there is probably very little room for overall program run time improvement within the hashing department.
@maxsu - dupd does use separate threads for I/O and the hashing. Still, I/O is most often the limiting factor.
I'll probably try GPU hashing at some point when I get my hands on suitable hardware, just for fun, but I don't expect any benefit.
Let me know if you observe any interesting results with vtune. I haven't used it.
This review of chunking and deduplication techniques mentions Shredder and StoreGPU - two proof of concepts that offload hashing onto the GPU.
I'm not sure if the CPU is really the bottleneck in dupd - content-based chunking does a fair bit of heavy lifting outside hashing, whereas dupd's workflow is almost entirely hashing, and may be I/O bound. I'm curious if you've given this some investigation, and what the numbers are like! Is there any room for GPU based improvement?
Cc: @jbruchon