Feature Request: GPGPU acceleration

maxsu commented 5 years ago

This review of chunking and deduplication techniques mentions Shredder and StoreGPU - two proof of concepts that offload hashing onto the GPU.

I'm not sure if the CPU is really the bottleneck in dupd - content-based chunking does a fair bit of heavy lifting outside hashing, whereas dupd's workflow is almost entirely hashing, and may be I/O bound. I'm curious if you've given this some investigation, and what the numbers are like! Is there any room for GPU based improvement?

Cc: @jbruchon

jbruchon commented 5 years ago

Unless you're running on a system with an EXTREMELY slow CPU or using a secure hash for an application that does not need it, hashing is not a bottleneck. If using xxhash (jdupes does), there's definitely no reason to offload to the GPU. There may be some merit for secure hash algorithms, but again, the algorithm would have to be a bottleneck. On many modern systems, there are multiple threads and cores, even on low-end Atom and Celeron chips, so just going multi-threaded would be both easier and more useful. GPGPU is also a somewhat difficult thing to add to a program, often with OS-specific and possibly vendor-specific dependencies.

In my tests, both the hash algorithm I wrote and the xxhash algorithm are so fast that crunching multiple gigabytes of data that's already in the disk cache seems nearly instantaneous on the systems I usually test on (AMD Phenom II X6 1035T, AMD A8-9600, and better). The few tests I've run on weak machines are not much different. These fast non-secure hash algorithms are so small that they fit in the L1 I-cache of the CPU and they don't iterate over the data several times.

I'm not going to do it for dupd, but here's a run of cachegrind for jdupes over a pile of git clones I've done, as an example; note that the XXH64 functions used about 4-5x the instructions of the memory allocator I wrote:

--------------------------------------------------------------------------------
I1 cache:         65536 B, 64 B, 2-way associative
D1 cache:         65536 B, 64 B, 2-way associative
LL cache:         6291456 B, 64 B, 48-way associative
Command:          ./jdupes -nrq /omitted/git
Data file:        cachegrind.out.17132
Events recorded:  Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Events shown:     Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Thresholds:       0.1 100 100 100 100 100 100 100 100
Include dirs:
User annotated:
Auto-annotation:  on

--------------------------------------------------------------------------------
         Ir   I1mr  ILmr         Dr    D1mr  DLmr         Dw    D1mw   DLmw
--------------------------------------------------------------------------------
218,016,407 24,289 1,336 52,196,122 846,204 7,679 28,926,733 134,551 97,096  PROGRAM TOTALS

--------------------------------------------------------------------------------
        Ir   I1mr ILmr        Dr    D1mr  DLmr        Dw   D1mw   DLmw  file:function
--------------------------------------------------------------------------------
34,436,850      1    1 4,919,550       0     0         0      0      0  /omitted/jdupes/xxhash.c:XXH64_round
33,632,462    257    5 1,555,632       6     0 9,914,412      0      0  /omitted/jdupes/xxhash.c:XXH64_update
29,127,815     13   13 6,657,750 449,466 1,095 4,050,399 15,205 13,158  /omitted/jdupes/jdupes.c:checkmatch
27,365,839 15,386  368 8,170,625 114,447 2,532 7,068,565 58,038 27,130  ???:???
16,448,144  1,535    8 1,962,206     714     0   278,820  9,733  8,734  /omitted/jdupes/string_malloc.c:string_malloc
14,616,698      3    3 4,414,360 114,670   413         0      0      0  /omitted/jdupes/jdupes.c:check_conditions
 9,667,156      1    1 9,667,156  20,151   428         0      0      0  /omitted/jdupes/xxhash.c:XXH_read64
 5,913,170     17   17 2,634,297  47,206     1   353,263     16      1  /omitted/jdupes/jdupes.c:grokdir
 2,657,256      8    8 1,006,142      53     0   422,892      0      0  ???:_IO_un_link
 2,609,677      8    8   533,169   3,952    20         0      0      0  /omitted/jdupes/jody_sort.c:numeric_sort
 2,458,072      9    9   872,219   3,639     1   475,756      0      0  ???:_IO_link_in
 2,329,139      5    5   320,320       0     0   289,635      0      0  /omitted/jdupes/xxhash.c:XXH64_digest
 2,155,768      5    5   630,978     430     0   157,745    258      0  ???:malloc
 2,104,160      8    8   435,057     272     0   435,028    380      0  /omitted/jdupes/jdupes.c:get_filehash
 1,955,820      7    7   713,610     199     0   264,300      0      0  ???:fclose
 1,850,164    206    8   290,741   1,538     1   211,447      0      0  ???:_IO_file_fopen
 1,829,618      4    4   256,903     529     0    18,837      0      0  /omitted/jdupes/string_malloc.c:string_free
 1,688,828      9    9   566,196     297     1   316,354      0      0  /omitted/jdupes/jdupes.c:check_singlefile
 1,646,592      5    5   571,862     487     0   210,686      0      0  ???:fread
 1,642,741     33   33   561,374   3,332    15   354,072    141      0  ???:vfprintf
 1,613,487      3    3   189,822       9     0   759,288 46,286 46,117  /omitted/jdupes/jdupes.c:init_newfile
 1,608,733      4    4   665,546   4,660     0   254,868     20      0  ???:readdir
 1,515,784      6    6   338,397      32     1   116,786     19      0  ???:_IO_file_xsputn
 1,347,970      4    2   370,030       0     0   343,600    384      0  ???:_IO_setb
 1,268,640      5    5   264,300      32     0   422,880      0      0  ???:_IO_file_close_it
 1,110,113      4    4   237,881       6     0   158,586    921      2  ???:_IO_file_doallocate
   945,692      0    0   171,944       0     0   171,944      0      0  /omitted/jdupes/xxhash.c:XXH64_mergeRound
   892,086      3    3   137,244       0     0   457,480    165      0  /omitted/jdupes/xxhash.c:XXH64_reset
   845,781      2    2   158,585       0     0   185,014      0      0  ???:_IO_file_open
   807,967     35   35   303,350   3,779    78   148,068     65      2  /omitted/jdupes/jdupes.c:main
   788,907  5,342    2   157,791   5,120     0        48      0      0  ???:free
   776,464    124    5   291,039   1,196     0   194,200      0      0  ???:_IO_file_underflow
   756,224     12   12   100,763   3,477   755         0      0      0  ???:strlen
   426,922      4    3   194,050       0     0    38,798      0      0  ???:__underflow
   388,100      2    2   135,835       0     0   116,430      0      0  ???:_IO_switch_to_get_mode
   370,020      2    2   105,720       0     0    26,430      0      0  ???:_IO_default_finish
   369,400     13   13    84,962       0     0    77,574      0      0  ???:_IO_file_seekoff
   348,898      1    1    34,891       0     0         2      0      0  ???:__xstat
   343,603      4    2   158,586      51     0    52,862      0      0  ???:_IO_doallocbuf
   343,603      1    1   105,724       0     0   132,155      0      0  ???:_IO_file_init
   316,360      1    1         0       0     0    63,272      4      0  /usr/include/sys/stat.h:check_singlefile
   316,350      1    1    31,635       0     0         0      0      0  ???:__lxstat
   264,310      2    2    26,431       0     0         0      0      0  ???:__fxstat
   237,960      2    2    23,796      19     1         0      0      0  ???:strchrnul
   236,099      6    2    67,456       0     0         0      0      0  ???:read
--- snip ---

For comparison, fdupes (from which jdupes was forked) still uses MD5, and this is the cachegrind for the same command:

--------------------------------------------------------------------------------
I1 cache:         65536 B, 64 B, 2-way associative
D1 cache:         65536 B, 64 B, 2-way associative
LL cache:         6291456 B, 64 B, 48-way associative
Command:          ./fdupes -nrqm /omitted/git/
Data file:        cachegrind.out.19190
Events recorded:  Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Events shown:     Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Thresholds:       0.1 100 100 100 100 100 100 100 100
Include dirs:
User annotated:
Auto-annotation:  on

--------------------------------------------------------------------------------
         Ir   I1mr  ILmr         Dr      D1mr   DLmr         Dw    D1mw   DLmw
--------------------------------------------------------------------------------
546,014,351 24,074 1,451 78,006,790 1,146,918 16,632 33,567,351 155,541 93,820  PROGRAM TOTALS

--------------------------------------------------------------------------------
         Ir   I1mr ILmr         Dr    D1mr  DLmr         Dw    D1mw   DLmw  file:function
--------------------------------------------------------------------------------
380,748,368     86   31 34,613,488  35,454   128  8,653,372      33      0  /omitted/git/fdupes/md5/md5.c:md5_process
 54,679,755 17,166  439 13,339,153  96,975 5,614 10,742,075 137,111 89,126  ???:???
 22,159,448     12   12  6,469,232 168,193   360  4,555,443     243      0  /omitted/git/fdupes/fdupes.c:checkmatch
  8,287,715      4    4    727,397       4     0  1,410,438       0      0  /omitted/git/fdupes/md5/md5.c:md5_append
  7,675,492     18   18  4,855,109   2,364     1    583,192   4,398  4,217  /omitted/git/fdupes/fdupes.c:grokdir
  7,520,576      2    2    752,060       0     0          4       0      0  ???:__xstat
  6,249,067      2    2  3,259,506 614,384   731          0       0      0  /omitted/git/fdupes/fdupes.c:is_hardlink
  6,086,394      5    5  1,781,370     300     0    445,343      60      1  ???:malloc
  4,635,360      0    0    489,288       0     0    412,032       0      0  ???:md5_finish
  3,940,056      2    2    231,768       0     0    309,024       0      0  /omitted/git/fdupes/md5/md5.c:md5_finish
  3,178,727      2    2  1,271,488   1,203     0          0       0      0  /omitted/git/fdupes/fdupes.c:filesize
  3,178,725      0    0          0       0     0    635,745   8,662      0  /usr/include/sys/stat.h:filesize
  2,987,232      1    1    437,784       0     0    412,032       0      0  /omitted/git/fdupes/fdupes.c:md5copy
  2,950,640      8    8  1,117,561     689     0    469,792       0      0  ???:_IO_un_link
  2,730,666      9    9    968,946  19,052     1    528,516       1      0  ???:_IO_link_in
  2,442,018  5,478    3    537,618  46,503 2,823          0       0      0  ???:free
  2,172,788      7    7    792,774     462     0    293,620       0      0  ???:fclose
  2,055,340    142    7    322,982     261     1    234,896       1      0  ???:_IO_file_fopen
  1,847,454      5    5    642,409     291     0    236,677       0      0  ???:fread
  1,618,090      4    4    669,395   3,231     0    256,352      62      0  ???:readdir
  1,497,502      2    2    411,078       0     0    381,716     830      0  ???:_IO_setb
  1,472,039      3    3    128,760     324     0    284,126       2      0  /omitted/git/fdupes/fdupes.c:getcrcsignatureuntil
  1,409,376      5    5    293,620      32     0    469,792       0      0  ???:_IO_file_close_it
  1,233,257      4    4    264,269     172     0    176,178   2,578      0  ???:_IO_file_doallocate
  1,136,792      0    0    274,022  17,911     0          0       0      0  /omitted/git/fdupes/fdupes.c:md5cmp
    939,584      2    2    176,172       1     0    205,534       0      0  ???:_IO_file_open
    894,006    205    5    335,031     473     0    223,590       0      0  ???:_IO_file_underflow
    877,321     21   21    323,465  68,598 1,556    182,190       1      1  /omitted/git/fdupes/fdupes.c:main
    771,252     14   14    103,464   3,978 2,337          0       0      0  ???:strlen
--- snip ---

Just for fun, here's a time comparison with all those files now in the cache (on the Phenom II X6 1035T, so an older DDR2 RAM machine, none of this new fast fun stuff) and the total number of path elements:

$ find ~/git/ | wc -l
31800

$ time ./fdupes -nrqm ~/git/
1805 duplicate files (in 1105 sets), occupying 2.9 megabytes

real    0m0.919s
user    0m0.196s
sys     0m0.720s

$ time ./jdupes -nrqm ~/git/
1805 duplicate files (in 1105 sets), occupying 2 MB

real    0m0.308s
user    0m0.060s
sys     0m0.245s

Ultimately, I think that GPGPU acceleration of the hashing will not help, especially since these numbers are for single-threaded programs.

One last note: fdupes performance shown is not typical because most users won't silence the progress with -q which is the biggest bottleneck in the program if you're using SSH (as I am) due to printf() family functions blocking and a call to printf() for every single item touched.

jvirkki commented 5 years ago

Thanks for the references to the papers, I'll read them when I have some time.

I agree with @jbruchon. There might be some combination of file set, CPU and storage media where hashing speed is a bottleneck but that will be quite rare.

In the case where the file data is being read from disk, hashing speed is not a factor because nearly all the run time is spent on I/O.

In the case where file data is read from memory cache, then the hashing speed becomes interesting. You can experiment with the -F flag using the supported hash algorithms (md5 sha1 sha512 xxhash).

If you use sha512 and then compare to a run with md5 you'll generally see faster results. That tells us sha512 is a bottleneck and going with a faster hash (md5) helped.

If you then compare runs with xxhash, it is generally a bit faster but not by a huge margin. According to https://github.com/Cyan4973/xxHash, xxhash is more than an order of magnitude faster than MD5, but the scan speedup is a small fraction of that. Therefore, switching to a hashing algorithm faster than xxhash is unlikely to produce any additional benefit.

Nonetheless, it might be fun to try just for the experiment. It may well be slower due to the overhead of having to copy the data to the GPU.

I don't own a suitable GPU though, so this isn't something I can test.

maxsu commented 5 years ago

Thanks for the responses and analysis. It does seem that it would be ideal to offload hashing to a secondary thread, but with a purported xxhash rate of >5 gb/s I agree that there would be minimal benefit.

Darn.

@jbruchon, on a side note, due to my own ignorance about profiling I'm not sure how cachegrind lets us determine

how much time the program spends waiting around for hashes to compute?
- there is no direct estimate of time taken in individuals calls
- I'm not sure how to relate cache misses to this overall time
how long does the program spend simply waiting for disk?
- cachegrind instruments the code, executes in a JIT interpreter
- arithmetic code runs much slower, disk code runs about the same
- does this distort the rate of cache misses?
- in any case, I don't know how to relate cache misses to time spent waiting on disk

I've replicated the experiment locally, and have done a little reading on cachegrind. I understand that effectively the rate of cache misses should have a calculable impact on runtime, but I'm not sure how to calculate appropriate coefficients or estimate distortion introduced by the instrumentation.

Currently gearing up to run dupd / jdupes in a sampling profiler (vtune), in hope that the result will be clearer to interpret - this is my first 'serious' interaction with a C profiler and I'm excited AND intimidated! However, I'd love to learn more about interpreting cache miss statistics - could you comment on proper interpretation and whether we can effectively intuit the above items (perhaps in another forum)?

jbruchon commented 5 years ago

@maxsu Cachegrind tells us how many CPU instructions were spent and where they were spent (the Ir counts) and how many data accesses (read or write) were performed (the Dr counts). The proposition was to offload the work to a GPU to take advantage of the mass parallelization offered by a GPU. For this to matter, some hidden assumptions must be valid:

The work being performed is "maxing out" however many CPU cores it's running on, making CPU the bottleneck;
The "heavy lifting" of the work to be performed can be performed in parallel; and
A GPU is capable of performing that type of work at least somewhat efficiently.

We can't assume that a platform has a powerful CPU, but most dev machines do, so let's just assume the program is on an Atom D525 or similarly wimpy CPU and that there is a CPU resource bottleneck. We use Cachegrind to find out where our CPU instructions are being used the most. (Forget the cache hit rates, because GPGPU is for performing massively more processing instructions in one second, not for improving cache hit rates.) Cachegrind tells us that the XXH64_* stuff is, in aggregate, using most of the CPU that is being used. However, looking at other functions that have "high" usage in the program, my string_malloc allocator is right up there in the same pile, as is the heaviest function of the program, checkmatch(), which shuffles entire files worth of data around. I wrote string_malloc to reduce malloc() call overhead for the huge amount of small memory allocations that jdupes performs (largely because of fdupes code that already did those allocations and momentum, not because spamming malloc/free is a good idea for something that fits in a known sized space and is likely to be discarded). I know that string_malloc is very lightweight code. If it's only using 1/4 the CPU instructions compared to the core XXH64_* functions combined, then those must not be very heavy either. Indeed, much of my jdupes work has been on streamlining the code to reduce bottlenecks and unnecessary calls and functions.

For the same data set, fdupes spends 5+ times more instructions in md5_process() alone than in the two core XXH64_* calls from jdupes combined. fdupes is far more likely to suffer a serious CPU bottleneck because of the unnecessarily massive hash function, and the numbers illustrate that quite well. It's possible that fdupes would benefit from a parallel MD5, but jdupes executes so few instructions that there's no way it would benefit. Keep in mind that even fdupes ran in less than a second on the test system for a data set of 31K files and 1.8K duplicates. With caches dropped, runtime explodes beyond a minute due to disk thrashing. I/O overhead beats CPU overhead by orders of magnitude. Cachegrind really isn't needed to see this, either.

tl;dr: no need to be precise or do math because simple heuristics tell us there is probably very little room for overall program run time improvement within the hashing department.

jvirkki commented 5 years ago

@maxsu - dupd does use separate threads for I/O and the hashing. Still, I/O is most often the limiting factor.

I'll probably try GPU hashing at some point when I get my hands on suitable hardware, just for fun, but I don't expect any benefit.

Let me know if you observe any interesting results with vtune. I haven't used it.

jvirkki / dupd

Feature Request: GPGPU acceleration #23