Open miqwit opened 7 years ago
Hi @miqwit thanks for the PR! I'll be away at a conference this week, but I'll try to look over this ASAP.
I think this approach is a very good idea. But isn't the positional precision of the individual words lost here? I mean, with having separate fields, input word1 is compared exclusively to other word1 values in the ES storage. With the flat approach, word1 is matched against all words simultaneously.
If so, this approach certainly works for a few hundreds of thousands of images just as well, or even better than the original approach. But at some point (maybe a billion images), flat search might be less efficient, because of many false positives matches due to the lost positional precision...?
Cheers, Simon
I guess if enough words are used, it's not a problem. Another though: How about using an array of integers for this in ES? Maybe an array may even be faster than the concatenated string...
Hi Simon.
Thanks for taking time to review this.
Yes, you are right. The precision is lost, and in case of billions of images it's better to increase the number of words to match. I could not come up with a formal proof to find the number of words needed to obtain a similar precision from the two techniques. I used tests to prove results were comparable.
Your thought is interesting, maybe an array is faster than a list of string. I will investigate on this. Thanks for the tip. I'll let you know.
My idea was not to replace the way it's done today, but to give another option (understanding the differences with the first option) which is faster and can be needed in more specific cases.
Mickaël.
2017-12-12 18:18 GMT+01:00 Simon Steinberger notifications@github.com:
I guess if enough words are used, it's not a problem. Another though: How about using an array of integers for this in ES? Maybe an array may even be faster than the concatenated string...
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ascribe/image-match/pull/83#issuecomment-351120894, or mute the thread https://github.com/notifications/unsubscribe-auth/AG5G4Gv_WYcGF6LE7qxnvxYyqUVkBXy7ks5s_rVYgaJpZM4PuynC .
-- Mickaël. R&D Software Engineer
Hi. I added a speed test with a "flatint" driver where words are stored as an array of lon ints. It is not as fast as the text array. From my test suite (3000 searches) I have the following results:
111.71158003807068 to search fields documents 71.15072679519653 to search flat documents 91.89284920692444 to search flatint documents
(cumulative time, in seconds)
ps: what is the proper way to access an elasticsearch from Travis to run my tests properly?
@miqwit are you still pursuing this? Also have you tried testing on a very high number of images? I would be curious to see a graph of the stats based on number of images.
@miqwit are you still pursuing this? Also have you tried testing on a very high number of images? I would be curious to see a graph of the stats based on number of images.
Yes, I tested with this dataset: http://www.vision.caltech.edu/Image_Datasets/Caltech101/101_ObjectCategories.tar.gz, which is 9100 images. I guess that would not count as a "very high number of images". What volume would be appropriate to convince you? Do you suggest a dataset?
You can see my speed test in this file already: https://github.com/miqwit/image-match/blob/words-flat/tests/test_elasticsearch_driver_speed.py
I will work on a graph display, which will be indeed interesting.
I need to index about 8,530,641 images so I would consider that a high number! 😆
Yes, well, I won't test it on that amount :) I'll see if I can raise it to a more significant number. At least I'll pull some statistical data.
Hi @taylorjdawson. I improved my script by generating graphs (with matplotlib) about various performances. I am in the process of validating the concept with some probabilistic approach (this should answer @SimonSteinberger thoughts in a more accurate fashion).
Still, you can run my last benchmark by launching an ES with: docker run -d -p 9200:9200 -p 9300:9300 elasticsearch:5.5.2
and running the test with: python test_elasticsearch_driver_speed.py --delete-indices=True --populate-indices=True
This will generate png images starting with plot_*
You can find generated plots on my machine here
Here is the query time one:
This needs more explanation about all of them, I plan to do this soon (here, or in a public post somewhere).
Hey @miqwit did you ever figure out why the flat_txt is so much faster than the fields query?
Hello @heipei. No, not really... This would need to dive in the ES engine. It is based on Lucene and open source, so I guess it's feasible, but I don't plan to do it. This work was just to show it.
A gut feeling would be that it's not testing the position of the words, as stated above in a previous comment from @SimonSteinberger. By testing field by field, I can conceive that it takes more time, providing more "accuracy". But really, I don't know...!
Hello. Thanks for your great work on image-match.
This branch is a suggestion to store in ES the document in a "flat" way, rather than a field per word, i.e. "simple_words": "123 456 789" rather than {"simple_word_01": 123, "simple_word_02": 456, "simple_word_03": 789}. The results are comparable. The correct image is always the one to be found.
In more details (from the test run test_elasticsearch_driver_speed.py):
The first result is always the correct image
In some cases (2,23%), the flat search returns less results (tested on 6 minimum_should_match values), probably due to a TF/IDF search
In some cases (7,37%), the flat search returns more results than the regular search
47,07% of the results from the flat search are the same than the fields search (the result tail)
Out of 3000 searches, the flat search is 55.81% faster than the field search (66.89 seconds instead of 104.24 seconds)
For 9145 documents (the Caltech Object Categories dataset), the flat index is 10% smaller than the fields index (39.8Mo instead of 43.9Mo)
For 9145 documents, it took 1,5% more time to ingest them in the flat index (217,34 seconds against 214,21 seconds)
The Caltech Object Categories dataset is to be found here: http://www.vision.caltech.edu/Image_Datasets/Caltech101/101_ObjectCategories.tar.gz
To run the test:
Install dependancies in a virtualenv virtualenv ~/.env/image-match-words-flat/ source ~/.env/image-match-words-flat/bin/activate pip install -r requirements.txt
Launch an elasticsearch via docker: docker run -d --rm --name=es -p 9200:9200 -p 9300:9300 elasticsearch:latest
Run the test (around 5 minutes run): python tests/test_elasticsearch_driver_speed.py