Simple false negative - Githubissues

ole-tange commented 6 years ago

seq 10000 > 1-col
par < 1-col > multicol
seq 10010 > 1-col-ver2

The files 1-col multicol and 1-col-ver2 are more than 70% identical. But ssdeep sees multicol and 1-col as 0% identical.

I have the feeling it is due to the fuzzy hashing looking at too big a chunk.

I hit this problem when comparing articles. Article 1 has line numbers and 80 char per line, article 2 had no line numbers and 60 chars per line. A part from this the two articles where identical.

a4lg commented 6 years ago

Unfortunately, this is not a bug.

The engine of ssdeep is a hash algorithm and it does not look the structure directly. ssdeep will work best at comparing similar files with minor modifications (such as header/footer changes or simple insertions/removals). If you desire to use this compare text (with column variations etc.), how about normalizing the text before passing it to ssdeep?

I think you are correct. However, we are not allowed to "improve" the algorithm itself because we are responsible for millions of hashes that are already generated (VirusTotal for example). If text normalization don't work on your workload, I recommend to use another LSH (locality sensitive hashing) algorithm.

ole-tange commented 6 years ago

May I suggest that you at the very least make this limitation clear in the documentation?

You probably even know how big the chunks have to be to be considered a match.

ssdeep-project / ssdeep

Simple false negative #9