jvirkki / dupd

CLI utility to find duplicate files
http://www.virkki.com/dupd
GNU General Public License v3.0
112 stars 16 forks source link

Improvement: Hashing Choices? #11

Closed danieldjewell closed 7 years ago

danieldjewell commented 7 years ago

After playing with the software a bit and also reviewing some of the code, it appears that MD5 is being used for comparing files (for those other than where a direct comparison is utilized).

Although MD5 is considered pretty compromised in the security world - it's probably OK here.

The interesting part that I found was that SHA1 was actually faster in single-threaded performance according to "openssl speed" (see below).

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
md5              66146.82k   192360.69k   422322.77k   607490.05k   693177.00k
sha1             76348.14k   213634.15k   464515.50k   665504.43k   768557.06k

OpenSSL 1.0.2g  1 Mar 2016
built on: reproducible build, date unspecified
options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) blowfish(idx)
compiler: cc -I. -I.. -I../include  -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -m64 -DL_ENDIAN -g -O2 -fdebug-prefix-map=/build/openssl-wIGtVG/openssl-1.0.2g=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro -Wa,--noexecstack -Wall -DMD32_REG_T=int -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM -DECP_NISTZ256_ASM

System tested was Ubuntu 16.10 running (4 cores):

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 58
model name      : Intel(R) Xeon(R) CPU E3-1240 V2 @ 3.40GHz
stepping        : 9
microcode       : 0xffffffff
cpu MHz         : 3243.183
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 8
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
bugs            :
bogomips        : 6486.36
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:
jvirkki commented 7 years ago

Good timing, as just yesterday I committed a first step towards allowing different hash algorithms to be selected.

I've played with this in the past a bit and it ranged from no difference (with murmurhash2) to slower (with xxhash) so I don't particularly expect improvement to be found in allowing different hash choices. Most of the hashing is done in a separate thread and it's not the bottleneck, even on a very slow CPU (atom).

Nonetheless, I'm planning on making it selectable just to make experimenting easier.

jvirkki commented 7 years ago

I've added support for sha1 and sha512 (in addition to current md5). Some timings:

% repeat 5 time dupd scan -p $HOME -q -F md5 dupd scan -p $HOME -q -F md5 4.02s user 6.02s system 143% cpu 7.000 total dupd scan -p $HOME -q -F md5 3.75s user 5.86s system 138% cpu 6.935 total dupd scan -p $HOME -q -F md5 3.60s user 6.04s system 138% cpu 6.948 total dupd scan -p $HOME -q -F md5 3.73s user 5.91s system 138% cpu 6.962 total dupd scan -p $HOME -q -F md5 3.66s user 6.00s system 138% cpu 6.965 total

% repeat 5 time dupd scan -p $HOME -q -F sha1 dupd scan -p $HOME -q -F sha1 4.67s user 5.80s system 140% cpu 7.433 total dupd scan -p $HOME -q -F sha1 4.60s user 5.88s system 141% cpu 7.429 total dupd scan -p $HOME -q -F sha1 4.56s user 5.96s system 140% cpu 7.473 total dupd scan -p $HOME -q -F sha1 5.24s user 6.00s system 147% cpu 7.615 total dupd scan -p $HOME -q -F sha1 4.51s user 6.02s system 141% cpu 7.444 total

% repeat 5 time dupd scan -p $HOME -q -F sha512 dupd scan -p $HOME -q -F sha512 6.20s user 5.93s system 140% cpu 8.653 total dupd scan -p $HOME -q -F sha512 6.17s user 5.96s system 140% cpu 8.616 total dupd scan -p $HOME -q -F sha512 6.08s user 6.12s system 140% cpu 8.662 total dupd scan -p $HOME -q -F sha512 6.06s user 6.10s system 140% cpu 8.627 total dupd scan -p $HOME -q -F sha512 6.06s user 6.05s system 140% cpu 8.626 total