treffynnon / lib_mysqludf_ssdeep

MySQL User Defined Functions for the ssdeep API -
libmysqludfssdeep.simonholywell.com
10 stars 3 forks source link

ssdeep version 2.5 produces different hash to ssdeep_fuzzy_hash #3

Closed mmaunder closed 13 years ago

mmaunder commented 13 years ago

Not sure if this is a bug, but I'd expect ssdeep's C binary to produce the same hash as the mysql function ssdeep_fuzzy_hash. This is ssdeep version 2.5 on mysql Server version: 5.1.54-1ubuntu4-log (Ubuntu)

mysql> select ssdeep_fuzzy_hash('The quick brown fox jumped over the lazy dog'); +-------------------------------------------------------------------+ | ssdeep_fuzzy_hash('The quick brown fox jumped over the lazy dog') | +-------------------------------------------------------------------+ | 3:FJKKI6myFRct:FHIp+i | +-------------------------------------------------------------------+ 1 row in set (0.01 sec)

Running ssdeep on test.txt which contains the same string with no CR/LF produces the output:

ssdeep,1.0--blocksize:hash:hash,filename 3:FJKKI6myFRcdn:FHIp+M,"/root/tmp/test.txt"

treffynnon commented 13 years ago

Whilst I do not know how the internals of the ssdeep upstream library would handle this; I would hazard a guess that it deals with files slightly differently and perhaps includes some other (meta?) information in the hash.

I also see the discrepancy between the two when using ssdeep_fuzzy_hash() and ssdeep_fuzzy_hash_filename() in both my PHP and MySQL extensions. So it is not just between the extension functions and the upstream ssdeep package binary.

Would you be able to do a comparison of ssdeep_fuzzy_hash() and ssdeep_fuzzy_hash_filename() on your installation of lib_mysqludf_ssdeep? Perhaps this is something that I can bring up with the upstream author and he can possibly shed some light on the difference between the two in his code/API.

treffynnon commented 13 years ago

I have heard the following back from Jesse:

Sorry you're having trouble with ssdeep. I don't know why the program is generating different results for these two functions. The fuzzy hashing algorithm is NOT designed to work with small inputs, however. You need to process about 4KB of input data to get meaningful results.

So are you able to test it with larger chunks of text information as mentioned above and in Issue #1?

mmaunder commented 13 years ago

Yes it does seem to improve. I get a 99 for the following where the sig tested against is what the ssdeep binary output when summing a file with the same string:

select ssdeep_fuzzy_compare('6:FHIp+aNIp+aNIp+aNIp+aNIp+aNIp+aNIp+aNIp+aNIp+aNIp+aNIp+aNIp+aNID:FWYYYYYYYYYYYYYYYYYYYYYYYG', ssdeep_fuzzy_hash('The quick brown fox jumped over the lazy dog The quick brown fox jumped over the lazy dog The quick brown fox jumped over the lazy dog The quick brown fox jumped over the lazy dog The quick brown fox jumped over the lazy dog The quick brown fox jumped over the lazy dog The quick brown fox jumped over the lazy dog The quick brown fox jumped over the lazy dog The quick brown fox jumped over the lazy dog The quick brown fox jumped over the lazy dog The quick brown fox jumped over the lazy dog The quick brown fox jumped over the lazy dog The quick brown fox jumped over the lazy dog The quick brown fox jumped over the lazy dog The quick brown fox jumped over the lazy dog The quick brown fox jumped over the lazy dog The quick brown fox jumped over the lazy dog The quick brown fox jumped over the lazy dog The quick brown fox jumped over the lazy dog The quick brown fox jumped over the lazy dog The quick brown fox jumped over the lazy dog The quick brown fox jumped over the lazy dog The quick brown fox jumped over the lazy dog The quick brown fox jumped over the lazy dog The quick brown fox jumped over the lazy dog'));

mmaunder commented 13 years ago

So I did a little more research. I generated random strings from 1 char to 500 chars long and used the ssdeep binary to generate an ssdeep hash and got mysql to gen a hash from the same string. Then I got mysql to do a comparison and get a score. This uses ssdeep 2.5 with the newest version of your plugin.

The number at the start of each row is the string length. I ran it 20 times for each string length with a different random string. The numbers after the string length are the scores for each random string.

As you can see it stabilizes after 400 chars in length with consistent results. This is probably very useful data to include in any ssdeep docs. The perl script I wrote to gen this is at the end.

1: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 11: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 16: 7, 0, 8, 0, 8, 0, 0, 9, 0, 0, 0, 0, 0, 9, 0, 0, 7, 7, 8, 0, 21: 0, 0, 0, 8, 10, 8, 13, 11, 8, 7, 0, 7, 0, 0, 0, 8, 0, 11, 7, 7, 26: 9, 10, 7, 0, 9, 10, 8, 7, 7, 12, 9, 9, 0, 9, 12, 10, 8, 14, 8, 10, 31: 15, 16, 16, 13, 11, 11, 9, 11, 0, 14, 7, 8, 10, 13, 9, 13, 8, 10, 8, 11, 36: 10, 9, 11, 11, 12, 15, 13, 13, 10, 18, 13, 15, 15, 16, 0, 17, 13, 12, 14, 10, 41: 14, 11, 13, 14, 15, 15, 18, 12, 19, 20, 15, 10, 14, 12, 16, 15, 15, 17, 13, 11, 46: 14, 17, 20, 13, 17, 16, 15, 16, 17, 9, 14, 11, 14, 23, 16, 20, 18, 14, 15, 15, 51: 21, 14, 15, 21, 17, 23, 22, 16, 23, 11, 20, 22, 15, 17, 21, 17, 18, 16, 16, 17, 56: 21, 15, 19, 19, 21, 21, 20, 19, 18, 21, 16, 23, 12, 18, 20, 25, 22, 22, 24, 15, 61: 18, 19, 25, 20, 13, 23, 17, 18, 21, 25, 23, 13, 22, 25, 27, 22, 22, 21, 26, 27, 66: 20, 23, 23, 29, 29, 20, 21, 20, 18, 24, 19, 25, 29, 19, 24, 18, 28, 20, 21, 29, 71: 18, 27, 30, 24, 22, 25, 18, 24, 22, 23, 25, 21, 22, 19, 25, 31, 23, 23, 25, 27, 76: 27, 28, 20, 24, 29, 24, 27, 24, 24, 28, 24, 31, 26, 26, 32, 25, 41, 33, 26, 25, 81: 30, 26, 23, 24, 28, 31, 32, 25, 20, 32, 30, 25, 24, 25, 29, 26, 32, 30, 25, 26, 86: 34, 19, 30, 31, 40, 33, 32, 28, 31, 34, 26, 26, 30, 27, 22, 25, 28, 23, 24, 26, 91: 36, 26, 39, 40, 34, 35, 38, 30, 33, 32, 30, 32, 33, 30, 30, 29, 22, 31, 25, 36, 96: 34, 35, 34, 34, 32, 35, 32, 37, 34, 25, 35, 40, 29, 20, 30, 37, 30, 37, 38, 31, 101: 36, 34, 37, 32, 35, 34, 35, 30, 36, 32, 43, 28, 33, 28, 23, 31, 38, 33, 35, 31, 106: 44, 47, 38, 38, 30, 39, 33, 45, 36, 43, 37, 33, 40, 38, 29, 35, 34, 40, 31, 37, 111: 35, 30, 34, 45, 37, 30, 41, 33, 28, 42, 28, 32, 37, 33, 44, 37, 41, 36, 39, 35, 116: 42, 43, 40, 38, 34, 36, 35, 30, 44, 34, 33, 38, 39, 30, 27, 45, 35, 39, 46, 46, 121: 38, 43, 52, 38, 41, 38, 48, 41, 42, 45, 38, 37, 42, 42, 40, 52, 40, 45, 40, 40, 126: 40, 35, 49, 40, 38, 46, 42, 49, 39, 40, 32, 45, 44, 44, 39, 40, 31, 52, 41, 32, 131: 56, 48, 50, 36, 51, 45, 40, 37, 48, 51, 48, 41, 53, 44, 48, 53, 41, 37, 38, 41, 136: 37, 46, 33, 46, 41, 47, 33, 39, 34, 46, 44, 46, 47, 44, 46, 49, 39, 41, 56, 33, 141: 50, 48, 39, 39, 43, 46, 47, 42, 45, 44, 47, 53, 53, 56, 52, 42, 42, 51, 32, 51, 146: 51, 50, 57, 55, 59, 61, 54, 58, 41, 54, 58, 38, 55, 57, 47, 45, 54, 48, 42, 46, 151: 55, 52, 45, 48, 57, 47, 59, 50, 50, 47, 51, 50, 47, 48, 51, 54, 49, 46, 40, 56, 156: 59, 53, 56, 44, 50, 49, 62, 47, 56, 54, 50, 59, 59, 53, 58, 51, 62, 43, 52, 59, 161: 56, 55, 49, 61, 49, 53, 56, 50, 56, 50, 49, 52, 55, 51, 59, 56, 48, 57, 55, 58, 166: 50, 60, 50, 63, 50, 64, 58, 64, 56, 59, 64, 53, 61, 60, 60, 54, 57, 60, 54, 50, 171: 55, 54, 59, 64, 53, 64, 54, 59, 63, 53, 55, 56, 62, 64, 64, 59, 51, 56, 55, 46, 176: 56, 43, 56, 64, 64, 54, 60, 55, 60, 58, 59, 58, 60, 61, 58, 64, 61, 59, 61, 52, 181: 59, 49, 64, 53, 56, 58, 53, 61, 58, 64, 64, 60, 53, 59, 55, 61, 58, 62, 62, 64, 186: 64, 60, 64, 47, 61, 63, 64, 64, 55, 54, 64, 64, 58, 64, 55, 58, 60, 60, 64, 62, 191: 64, 61, 58, 56, 62, 64, 64, 62, 57, 58, 64, 61, 57, 64, 52, 64, 64, 64, 62, 51, 196: 55, 61, 66, 64, 70, 64, 70, 64, 80, 68, 55, 64, 58, 64, 70, 80, 64, 78, 59, 64, 201: 90, 76, 64, 76, 54, 64, 58, 100, 59, 66, 64, 64, 64, 70, 78, 80, 76, 68, 74, 68, 206: 68, 64, 72, 64, 82, 84, 72, 64, 72, 78, 62, 74, 63, 64, 66, 82, 70, 64, 100, 72, 211: 68, 82, 88, 64, 64, 64, 46, 64, 80, 92, 70, 63, 74, 66, 70, 74, 76, 64, 70, 64, 216: 84, 76, 96, 68, 72, 64, 76, 78, 86, 70, 80, 84, 58, 84, 66, 82, 94, 68, 80, 62, 221: 100, 88, 64, 78, 70, 74, 78, 70, 86, 68, 80, 84, 64, 68, 70, 72, 80, 96, 76, 92, 226: 74, 80, 64, 64, 64, 68, 100, 70, 68, 74, 94, 76, 96, 86, 64, 82, 64, 74, 94, 78, 231: 64, 74, 88, 64, 66, 90, 76, 68, 74, 64, 70, 74, 86, 72, 86, 82, 72, 76, 100, 92, 236: 68, 70, 72, 70, 66, 74, 78, 66, 72, 80, 78, 74, 90, 84, 78, 88, 78, 76, 76, 72, 241: 88, 88, 74, 100, 94, 100, 68, 88, 76, 88, 80, 88, 88, 98, 88, 76, 72, 82, 76, 80, 246: 78, 86, 70, 100, 74, 92, 94, 66, 90, 86, 94, 86, 88, 92, 68, 74, 86, 72, 68, 68, 251: 76, 78, 74, 74, 100, 66, 82, 96, 84, 90, 94, 72, 70, 82, 80, 72, 96, 94, 68, 98, 256: 76, 64, 96, 92, 90, 100, 100, 82, 100, 70, 82, 88, 98, 80, 96, 88, 100, 100, 84, 80, 261: 88, 76, 76, 86, 96, 86, 94, 80, 68, 100, 90, 100, 88, 100, 88, 96, 70, 96, 84, 94, 266: 70, 94, 82, 94, 82, 94, 100, 100, 90, 98, 88, 94, 86, 100, 74, 100, 94, 68, 100, 88, 271: 100, 68, 100, 74, 74, 100, 64, 100, 88, 100, 76, 98, 100, 100, 100, 86, 92, 94, 84, 100, 276: 100, 100, 76, 78, 100, 98, 100, 92, 88, 100, 88, 96, 100, 74, 68, 90, 90, 98, 86, 90, 281: 84, 98, 88, 80, 100, 100, 100, 100, 74, 100, 76, 100, 94, 90, 98, 90, 90, 100, 100, 82, 286: 100, 84, 100, 96, 92, 100, 100, 92, 98, 100, 82, 88, 100, 100, 94, 92, 96, 100, 84, 76, 291: 100, 100, 94, 100, 94, 100, 94, 82, 90, 100, 84, 82, 70, 74, 100, 100, 94, 74, 94, 98, 296: 100, 100, 98, 92, 100, 100, 100, 100, 100, 100, 86, 92, 100, 90, 80, 100, 92, 98, 92, 84, 301: 96, 100, 100, 100, 100, 94, 96, 94, 100, 100, 100, 100, 100, 100, 88, 100, 100, 100, 96, 100, 306: 100, 84, 96, 100, 100, 100, 90, 100, 86, 100, 90, 100, 96, 100, 96, 100, 100, 100, 100, 100, 311: 100, 96, 94, 88, 100, 92, 100, 100, 86, 94, 100, 82, 100, 100, 100, 100, 74, 100, 100, 92, 316: 100, 100, 100, 100, 94, 100, 100, 100, 100, 100, 96, 100, 100, 100, 100, 100, 100, 100, 100, 86, 321: 100, 100, 92, 100, 100, 100, 100, 100, 100, 96, 100, 100, 100, 84, 100, 100, 94, 100, 100, 100, 326: 100, 100, 100, 100, 100, 96, 80, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 331: 100, 86, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 336: 100, 100, 92, 100, 100, 100, 100, 100, 100, 100, 100, 98, 100, 90, 100, 94, 100, 100, 100, 100, 341: 100, 96, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 94, 100, 100, 100, 100, 100, 100, 100, 346: 86, 100, 100, 100, 100, 86, 100, 100, 100, 100, 100, 100, 100, 100, 100, 88, 100, 100, 100, 100, 351: 100, 100, 100, 100, 100, 100, 90, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 356: 100, 98, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 361: 100, 100, 100, 100, 100, 100, 100, 100, 94, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 366: 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 371: 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 376: 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 98, 100, 100, 100, 381: 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 98, 100, 100, 100, 100, 100, 100, 100, 386: 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 90, 100, 100, 100, 100, 100, 391: 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 396: 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 401: 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 406: 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 411: 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 416: 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 421: 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 426: 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 431: 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 436: 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 441: 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 446: 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 451: 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 456: 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 461: 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 466: 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 471: 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 476: 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 481: 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 486: 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 491: 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 496: 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100,

#!/usr/bin/perl
use strict;
use DBI;
my $dbname = 'yourDBNameHere';
my $dbhost = 'localhost';
my $dbpass = 'yourPasswordHere';
my $dbh = DBI->connect("DBI:mysql:database=" . $dbname . ";host=" . $dbhost . ";port=3306;mysql_compression=0;mysql_client_found_rows=0", 'root' ,$dbpass, { AutoCommit => 1, PrintError => 1, RaiseError => 0, });

for(my $i = 1; $i <= 501; $i += 5){
        print "$i: ";
        for(1..20){
                my $str = randString($i);
                open(FH, ">tmp.txt") || die $!;
                print FH $str;
                close(FH);
                my $sd = `ssdeep tmp.txt`;
                my @arr = split(/[\r\n]/, $sd);
                my $sum = $arr[1];
                chomp $sum;
                if($sum =~ m/^(.+),"\//){
                        $sum = $1;
                        my $res = $dbh->selectrow_array("select ssdeep_fuzzy_compare(?, ssdeep_fuzzy_hash(?))", undef, $sum, $str);
                        print "$res, ";
                }
        }
        print "\n";
}
sub randString {
        my $length_of_randomstring=shift;

        my @chars=('a'..'z','A'..'Z','0'..'9','_');
        my $random_string;
        foreach (1..$length_of_randomstring) {
                $random_string.=$chars[rand @chars];
                if(rand(1000) > 900){
                        $random_string .= ' ';
                }
        }
        $random_string = substr($random_string, 0, $length_of_randomstring);
        return $random_string;
}
treffynnon commented 13 years ago

Thanks for your investigative work on this Mark. When I get a chance I will update the PHP and MySQL docs to reflect your findings on the bare minimum string length required and Jesse's recommended 4KB minimum.

I think that this satisfactorily closes this ticket and thanks once again for testing and reporting issues.