jmacd / xdelta

open-source binary diff, delta/differential compression tools, VCDIFF/RFC 3284 delta compression
http://xdelta.org
1.12k stars 187 forks source link

[request] xdelta as a file comparer #148

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
As xdelta already deals with differences between two files, I'd like to propose 
a new feature:
- during the very same comparison flow, when a specific command-line 
argument(s) is present, xdelta should not produce a patch file as its output 
but should just say something like "the similarity of two files is about NN.N%".
You see, sometimes you need to know just a fact of some similarity between two 
files. For example, one may find it useful for two identical photos with 
different embedded descriptions. Or for two .mp3 files with identical sound 
content but different id3 tags.

I've tried to achieve the mentioned functionality by analyzing xdelta's output 
inside awk's script, but xdelta, in addition to details of its comparison 
progress, seems to add some "header" information (or whatever it is) to its 
output, therefore making the calculation of the percentage of similarity (or 
difference) to be inaccurate, especially for small files.

Original issue reported on code.google.com by dv...@ukr.net on 15 Oct 2012 at 7:42

GoogleCodeExporter commented 9 years ago
I've attached the .awk script, in case you are interested.

Original comment by dv...@ukr.net on 16 Oct 2012 at 5:36

Attachments:

GoogleCodeExporter commented 9 years ago
Any comments or advice, please?..

Original comment by dv...@ukr.net on 23 Jan 2013 at 10:27

GoogleCodeExporter commented 9 years ago
Any comments or advice, please?..

Original comment by dv...@ukr.net on 12 Jun 2013 at 11:48

GoogleCodeExporter commented 9 years ago
You should use fuzzy hash (google this: ssdeep ) for binary and text files.

Comparing sound or image files is a different story.

Original comment by mgr.inz....@gmail.com on 22 Jul 2013 at 10:53

GoogleCodeExporter commented 9 years ago
Thanks, but ssdeep does not seem to do what I expect. As an example, let's take 
autoruns.exe (649864 bytes) and 
autorunsc.exe (567944 bytes) from SysInternals - these are GUI and Console 
version of the same program. When executing "jdiff.exe autoruns.exe 
autorunsc.exe a.patch" (JDIFF - Jojo's binary diff), it produces 'a.patch' with 
a size of 190457 bytes. So, raw similarity of these two files can be calculated 
as (1 - 190457/567944) = 0.66466 = 66.466%.
In the same time, ssdeep shows 0. I think it's because ssdeep does not detect 
moved blocks whereas both jdiff and xdelta does.

Original comment by dv...@ukr.net on 13 Aug 2013 at 9:09