kayzhu / LSHash

A fast Python implementation of locality sensitive hashing.
MIT License
660 stars 158 forks source link

query method inefficiently implemented? #13

Closed phdowling closed 8 years ago

phdowling commented 8 years ago

In it's current form, it seems that 'LShash.query' simply computes a hash for a query vector, and then computes the hamming distance manually for each point in each hash table. That means we are doing num_hashtables * num_indexed_points distance computations for every query.

Specifically, the problem is here: https://github.com/kayzh/LSHash/blob/master/lshash/lshash.py#L237 We should not need to iterate through all keys in the index just to query one vector.

From what I understand, the whole point of LSH is to do hashed lookups rather than scanning all our data. Here we are computing hashes and then not using the one feature that makes them so fast, the fact that we can throw them into an index and do lookups in O(1) or at least O(log(n)). Lookups in O(n) are what LSH is supposed to be replacing. This implementation may still be fast enough for some use cases since the hamming distance computations might be cheaper than dot products on large vectors, but it does not scale in the number of samples in the index, which I would see as a serious problem.

Any thoughts on this? Please correct me if I got anything wrong here. I don't mean to complain, this is a library which I enjoy using, but it seems that a core part of it could be made more efficient.

phdowling commented 8 years ago

Ah, turns out I misread the code, the n-lookups-per-sample scenario is only triggered when the hamming distance measure is used. I'm not sure that scenario can be easily improved. Sorry about this, closing.