Closed Hu6li closed 11 months ago
The default format of the hashlookup Bloom filter is SHA1 in upper-case. But it's indeed a good idea, if there are other sources using a different format. I'll update the PR to include it as an option to avoid doing the double check by default.
In a near future, we would like to create a hashlookup format definition which includes the type of encoding and canonization used in the Bloom filter.
Thank you for the pull-request and the very good point. I fixed by adding an option for the lower-case lookup.
If you see something else, let me know.
Perfect, thanks for your reply.
I was concerned about the performance as well and therefore gave it another thought. Maybe another approach could be to first check for upper-case hashes in the or-operation since python's logical or-operation works as a short-circuit evaluation:
In short-circuit evaluation, the second operand is only evaluated if the first operand does not determine the outcome of the entire expression.
This would mean by default the lookup would be as fast as normally but if there was a bloom filter with lower case values inserted the second one will be evaluated (and thus take longer).
Not sure which approach would be better but the optional one as his own advantages as well.
Thanks for accepting and adding this fix.
I see. Thank you very much for the feedback. Maybe we should improve the Bloom filter selection at some point when there are multiple ones to choose from.
On line 123 the value of a file is checked against a bloom filter passed by arguments. If the bloom filter was generated using upper case characters the result will be unknown even if the hash is inside the set.
my first approach was to use: if value.encode() in map(str.lower,bf['bf']):
unfortunately bf is not iterable thus i solved it using an or