rhsimplex / image-match

🎇 Quickly search over billions of images
2.94k stars 405 forks source link

Words flat #83

Open miqwit opened 7 years ago

miqwit commented 7 years ago

Hello. Thanks for your great work on image-match.

This branch is a suggestion to store in ES the document in a "flat" way, rather than a field per word, i.e. "simple_words": "123 456 789" rather than {"simple_word_01": 123, "simple_word_02": 456, "simple_word_03": 789}. The results are comparable. The correct image is always the one to be found.

In more details (from the test run test_elasticsearch_driver_speed.py):

The Caltech Object Categories dataset is to be found here: http://www.vision.caltech.edu/Image_Datasets/Caltech101/101_ObjectCategories.tar.gz

To run the test:

  1. Install dependancies in a virtualenv virtualenv ~/.env/image-match-words-flat/ source ~/.env/image-match-words-flat/bin/activate pip install -r requirements.txt

  2. Launch an elasticsearch via docker: docker run -d --rm --name=es -p 9200:9200 -p 9300:9300 elasticsearch:latest

  3. Run the test (around 5 minutes run): python tests/test_elasticsearch_driver_speed.py

rhsimplex commented 7 years ago

Hi @miqwit thanks for the PR! I'll be away at a conference this week, but I'll try to look over this ASAP.

SimonSteinberger commented 6 years ago

I think this approach is a very good idea. But isn't the positional precision of the individual words lost here? I mean, with having separate fields, input word1 is compared exclusively to other word1 values in the ES storage. With the flat approach, word1 is matched against all words simultaneously.

If so, this approach certainly works for a few hundreds of thousands of images just as well, or even better than the original approach. But at some point (maybe a billion images), flat search might be less efficient, because of many false positives matches due to the lost positional precision...?

Cheers, Simon

SimonSteinberger commented 6 years ago

I guess if enough words are used, it's not a problem. Another though: How about using an array of integers for this in ES? Maybe an array may even be faster than the concatenated string...

miqwit commented 6 years ago

Hi Simon.

Thanks for taking time to review this.

Yes, you are right. The precision is lost, and in case of billions of images it's better to increase the number of words to match. I could not come up with a formal proof to find the number of words needed to obtain a similar precision from the two techniques. I used tests to prove results were comparable.

Your thought is interesting, maybe an array is faster than a list of string. I will investigate on this. Thanks for the tip. I'll let you know.

My idea was not to replace the way it's done today, but to give another option (understanding the differences with the first option) which is faster and can be needed in more specific cases.

Mickaël.

2017-12-12 18:18 GMT+01:00 Simon Steinberger notifications@github.com:

I guess if enough words are used, it's not a problem. Another though: How about using an array of integers for this in ES? Maybe an array may even be faster than the concatenated string...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ascribe/image-match/pull/83#issuecomment-351120894, or mute the thread https://github.com/notifications/unsubscribe-auth/AG5G4Gv_WYcGF6LE7qxnvxYyqUVkBXy7ks5s_rVYgaJpZM4PuynC .

-- Mickaël. R&D Software Engineer

miqwit commented 6 years ago

Hi. I added a speed test with a "flatint" driver where words are stored as an array of lon ints. It is not as fast as the text array. From my test suite (3000 searches) I have the following results:

111.71158003807068 to search fields documents 71.15072679519653 to search flat documents 91.89284920692444 to search flatint documents

(cumulative time, in seconds)

ps: what is the proper way to access an elasticsearch from Travis to run my tests properly?

taylorjdawson commented 5 years ago

@miqwit are you still pursuing this? Also have you tried testing on a very high number of images? I would be curious to see a graph of the stats based on number of images.

miqwit commented 5 years ago

@miqwit are you still pursuing this? Also have you tried testing on a very high number of images? I would be curious to see a graph of the stats based on number of images.

Yes, I tested with this dataset: http://www.vision.caltech.edu/Image_Datasets/Caltech101/101_ObjectCategories.tar.gz, which is 9100 images. I guess that would not count as a "very high number of images". What volume would be appropriate to convince you? Do you suggest a dataset?

You can see my speed test in this file already: https://github.com/miqwit/image-match/blob/words-flat/tests/test_elasticsearch_driver_speed.py

I will work on a graph display, which will be indeed interesting.

taylorjdawson commented 5 years ago

I need to index about 8,530,641 images so I would consider that a high number! 😆

miqwit commented 5 years ago

Yes, well, I won't test it on that amount :) I'll see if I can raise it to a more significant number. At least I'll pull some statistical data.

miqwit commented 5 years ago

Hi @taylorjdawson. I improved my script by generating graphs (with matplotlib) about various performances. I am in the process of validating the concept with some probabilistic approach (this should answer @SimonSteinberger thoughts in a more accurate fashion).

Still, you can run my last benchmark by launching an ES with: docker run -d -p 9200:9200 -p 9300:9300 elasticsearch:5.5.2

and running the test with: python test_elasticsearch_driver_speed.py --delete-indices=True --populate-indices=True

This will generate png images starting with plot_*

miqwit commented 5 years ago

You can find generated plots on my machine here

Here is the query time one:

This needs more explanation about all of them, I plan to do this soon (here, or in a public post somewhere).

heipei commented 2 years ago

Hey @miqwit did you ever figure out why the flat_txt is so much faster than the fields query?

miqwit commented 2 years ago

Hello @heipei. No, not really... This would need to dive in the ES engine. It is based on Lucene and open source, so I guess it's feasible, but I don't plan to do it. This work was just to show it.

A gut feeling would be that it's not testing the position of the words, as stated above in a previous comment from @SimonSteinberger. By testing field by field, I can conceive that it takes more time, providing more "accuracy". But really, I don't know...!