jvirkki / dupd

CLI utility to find duplicate files
http://www.virkki.com/dupd
GNU General Public License v3.0
112 stars 16 forks source link

Increase read block size for improved performance #8

Closed jbruchon closed 8 years ago

jbruchon commented 8 years ago

Here's an easy performance boost: in filecompare.c increase FC_BLOCK_SIZE from 8192 to a much larger number like 1048576 (1 MiB). I have been heavily benchmarking jdupes vs. dupd and on a 900,000+-file data set with an average file size of ~527 KB, the increased read block size in jdupes makes a huge difference due to greatly reduced disk thrashing. jdupes processed those 900,000+ files in ~7800 seconds, dupd processes them in ~9800 seconds which is what ultimately led me to the read block size difference.

The obvious downsides of this change would be increased memory usage and possible excess data reads during file comparisons if the smaller block size would allow for earlier comparison termination. Still, it's incredibly low-hanging fruit for significantly increased performance. It would also significantly reduce calls to read() and memcmp() and the associated call overhead due to performing far less read loop iterations (128x less iterations per MiB of file data).

jvirkki commented 8 years ago

Added a --fileblocksize option to make it easier to tune and test various sizes.

jvirkki commented 8 years ago

The optimum size may likely vary based on hardware and file set. Did some testing with my primary test set of files and the sweet spot there appears to be 128K.

fileblocksize

jvirkki commented 8 years ago

Changed default to 128K based on above runs. The value is now configurable so it'll be easy to experiment more.