Closed rr- closed 7 years ago
Hi @rr-, thanks for your interest in the project.
To your questions:
I wouldn't set it up this way in a relational database. A more efficient way is to look up on individual columns without the join. You can even iterate through them randomly. This is usually fast because matching images will match on more than one word, but you don't need to check them all -- a match is a match as soon as the computed dist
is below your target threshold (one pitfall is to make sure to limit the results -- if a word returns many matches, it's not really useful for discriminating the results). Elasticsearch, being a full text search, makes it easy to search over all the fields simultaneously, but the algorithm wasn't designed with this capability in mind.
We never did a lot of work on optimizing N
and k
since we weren't space constrained, but I know the algorithm as it is is pretty space-inefficient (this also means varying N
and k
isn't well-tested, so be careful). The original value of N=63
comes from a limitation of MongoDB (our original backend), and is arbitrary. It's very possible you might get similar performance by lowering either. See also #2.
Please let me know if there's any additional help I can offer, and if you're ok with sharing for your driver code I'd be happy to take a look.
I'm asking this question since I think it might benefit other people.
I made a driver that stores the hashes in a PostgreSQL database. My test database contains 25k images. The query process looks roughly like this:
[12895189, 2517912795, 72159172, 1275215791, ...]
)Get IDs and signatures of relevant images, using the lookup words in a query similar to this:
(for a test image this returns about 7.5k images)
image.image_id
,image_signature_lookup.image_id
andimage_signature_lookup.word
are all indexed.ProcessPoolExecutor
andnumpy
, assemble the actual search resultsTest query made this way takes 1.2 s. What I'm worried about is scaling this solution - for every 10k images, the database grows by about 75 MB in size, and over 600 000 lookup records are created.
My questions are:
N
andk
values, besides the default ones, could bring down the database size while still being useful for detecting near duplicates (+- JPEG artifacts etc.)?