Request: Bloom Filter for ssdeep

ssdeep-project / ssdeep

Fuzzy hashing API and fuzzy hashing tool

https://ssdeep-project.github.io/ssdeep/index.html

GNU General Public License v2.0

671 stars 115 forks source link

Request: Bloom Filter for ssdeep #21

Open rjzak opened 5 years ago

rjzak commented 5 years ago

For times when processing several files for similarity, it would be helpful to have a bloom filter, but for SSDeep. Something where the algorithm could answer "have I seen a file like this before", but without having to store and check every single prior ssdeep hash.

a4lg commented 5 years ago

Hi, thanks for your feedback!

I am interested in bloom filter-like feature for ssdeep hashes in general but not quite interested in implementing it in the ssdeep software (due to various demands that depend on the scale).

To my mind, it seems that...

Implementing helper functions to libfuzzy library for better indexing and
Documenting some of ssdeep internals

is reasonable.

I, as a huge ssdeep fan, once made a ssdeep-compatible library ffuzzypp and parallel clustering tool fast-ssdeep-clus (now open sourced as https://github.com/a4lg/fast-ssdeep-clus) before I become a ssdeep maintainer. This is fairly simple (does not use any bloom filters) but can make full cluster list from 30M raw ssdeep hashes in only a few days (on a 2x10-core server) and adding some hashes to the cluster list would not take that time.

So, I think making features to help advanced users make indexing efficient is reasonable (and plausible). Could you let me know your thoughts about this option?

rjzak commented 5 years ago

I was thinking of a solution which wouldn't necessarily use bloom filters, but would replicate some of their functionality. Basically, having a low memory algorithm which could indicate with some degree of error if the provided ssdeep hash was similar to anything it's seen before, but without having to store all the prior examples to make this determination. To me, it seems such a solution would be able to process the 30M examples quicker than a few days (just a guess). The proposed clustering idea seems to still require storing all the previously observed hashes, and doesn't fit my use case. I'm also working with malware, but would like to de-duplicate the collection without having a runtime of O(N^N).

tohihuty commented 5 years ago

At least its is possible to decrease the huge number of hash comparisons by pre-filtering as depicted at https://www.virusbulletin.com/virusbulletin/2015/11/optimizing-ssdeep-use-scale/. My C++ implementation of a tailored hash table and comparison function processes 250k hashes in 2 seconds (1 thread). In the event that you need to compare some new hashes against a big database it is even faster (200 ms) and less memory-consuming, because the hash table is only filled with the n-grams of the new hashes.