hashlookup / hashlookup-forensic-analyser

Analyse a forensic target (such as a directory) to find and report files found and not found from CIRCL hashlookup public service - https://circl.lu/services/hashlookup/
https://hashlookup.github.io/hashlookup-forensic-analyser/
Other
121 stars 13 forks source link

support BF with lower and/or upper case hashes #15

Closed Hu6li closed 11 months ago

Hu6li commented 12 months ago

On line 123 the value of a file is checked against a bloom filter passed by arguments. If the bloom filter was generated using upper case characters the result will be unknown even if the hash is inside the set.

my first approach was to use: if value.encode() in map(str.lower,bf['bf']):

unfortunately bf is not iterable thus i solved it using an or

adulau commented 11 months ago

The default format of the hashlookup Bloom filter is SHA1 in upper-case. But it's indeed a good idea, if there are other sources using a different format. I'll update the PR to include it as an option to avoid doing the double check by default.

In a near future, we would like to create a hashlookup format definition which includes the type of encoding and canonization used in the Bloom filter.

adulau commented 11 months ago

Thank you for the pull-request and the very good point. I fixed by adding an option for the lower-case lookup.

https://github.com/hashlookup/hashlookup-forensic-analyser/commit/d0410cd6438fa8b8c9c0543979adf23d557aaacb

If you see something else, let me know.

Hu6li commented 11 months ago

Perfect, thanks for your reply.

I was concerned about the performance as well and therefore gave it another thought. Maybe another approach could be to first check for upper-case hashes in the or-operation since python's logical or-operation works as a short-circuit evaluation:

In short-circuit evaluation, the second operand is only evaluated if the first operand does not determine the outcome of the entire expression.

This would mean by default the lookup would be as fast as normally but if there was a bloom filter with lower case values inserted the second one will be evaluated (and thus take longer).

Not sure which approach would be better but the optional one as his own advantages as well.

Thanks for accepting and adding this fix.

adulau commented 11 months ago

I see. Thank you very much for the feedback. Maybe we should improve the Bloom filter selection at some point when there are multiple ones to choose from.