rhsimplex / image-match

🎇 Quickly search over billions of images
2.94k stars 405 forks source link

(Question) Storing hashes in RDBMS #50

Closed rr- closed 7 years ago

rr- commented 7 years ago

I'm asking this question since I think it might benefit other people.

I made a driver that stores the hashes in a PostgreSQL database. My test database contains 25k images. The query process looks roughly like this:

  1. Choose an image to query database for
  2. Compute the image's signature and lookup words (let's say the words = [12895189, 2517912795, 72159172, 1275215791, ...])
  3. Get IDs and signatures of relevant images, using the lookup words in a query similar to this:

    SELECT DISTINCT(image_id), image_signature
    FROM image
    INNER JOIN image_signature_lookup ON image.image_id = image_signature_lookup.image_id
    WHERE image_signature_lookup.word IN (12895189, 2517912795, 72159172, 1275215791, ...)

    (for a test image this returns about 7.5k images)

    image.image_id, image_signature_lookup.image_id and image_signature_lookup.word are all indexed.

  4. Compute the final distances with ProcessPoolExecutor and numpy, assemble the actual search results

Test query made this way takes 1.2 s. What I'm worried about is scaling this solution - for every 10k images, the database grows by about 75 MB in size, and over 600 000 lookup records are created.

My questions are:

  1. is this anywhere near as fast as it should be, or should I set up elastic search after all?
  2. what N and k values, besides the default ones, could bring down the database size while still being useful for detecting near duplicates (+- JPEG artifacts etc.)?
rhsimplex commented 7 years ago

Hi @rr-, thanks for your interest in the project.

To your questions:

  1. I wouldn't set it up this way in a relational database. A more efficient way is to look up on individual columns without the join. You can even iterate through them randomly. This is usually fast because matching images will match on more than one word, but you don't need to check them all -- a match is a match as soon as the computed dist is below your target threshold (one pitfall is to make sure to limit the results -- if a word returns many matches, it's not really useful for discriminating the results). Elasticsearch, being a full text search, makes it easy to search over all the fields simultaneously, but the algorithm wasn't designed with this capability in mind.

  2. We never did a lot of work on optimizing N and k since we weren't space constrained, but I know the algorithm as it is is pretty space-inefficient (this also means varying N and k isn't well-tested, so be careful). The original value of N=63 comes from a limitation of MongoDB (our original backend), and is arbitrary. It's very possible you might get similar performance by lowering either. See also #2.

Please let me know if there's any additional help I can offer, and if you're ok with sharing for your driver code I'd be happy to take a look.