rhsimplex / image-match

🎇 Quickly search over billions of images
2.94k stars 405 forks source link

Option to make ElasticSearch insertions synchronous / blocking? #94

Closed duhaime closed 6 years ago

duhaime commented 6 years ago

This morning I tried inserting some images into an ElasticSearch database then querying for the inserted images. I was surprised to see that every single image yielded no matches, as I expected queries to at least return the trivial match where an image matches itself:

from elasticsearch import Elasticsearch
from image_match.elasticsearch_driver import SignatureES
from glob import glob

es = Elasticsearch()
ses = SignatureES(es, timeout='1200s', distance_cutoff=0.4)

imgs = glob('images/*.jpg')
for i in imgs:
  ses.add_image(i)
  print(ses.search_image(i, all_orientations=True)) # this line prints [] results

Eventually I realized that my queries were returning no results because the query seemed to be executing before the image was indexed. To test this hypothesis, I slept a bit between insertion and query, and got the matches I expected.

This behavior surprised me, probably because I'm new to ElasticSearch. To help others who expect synchronous behavior, would there be any interest in adding an optional parameter to the ElasticSearch driver's insert_single_record method that allowed users to call self.es.refresh() to make the newly inserted record searchable? Evidently calling the refresh method makes the records inserted into an ES index searchable (ES team member Nik Everett says this method is automatically called once a second but that's not fast enough for a simple loop like the one above).

Right now I'm having each process on my host insert records into a distinct index, then enter a while loop that sleeps until the number of docs in the index equals the number that have been inserted. It would be great if this kind of synchronous behavior could be a part of image-match's ElasticSearch wrapper. I'm happy to submit a PR if that sounds like a reasonable feature!

duhaime commented 6 years ago

This feature was evidently reasonable enough to already be implemented! The elasticsearch driver .insert_single_record() accepts a refresh_after -- setting to True accomplishes the intended behavior. Thanks again for this great work!