Add In-Index Similarity Search

jeresig commented 8 years ago

Thank you so much for creating the amazing Pastec project, @magwyz! This pull request adds a new API endpoint:

$ curl http://127.0.0.1:4212/index/images/1221010341
{"bounding_rects":[{"height":356,"width":291,"x":34,"y":31},{"height":315,"width":290,"x":34,"y":71}],"image_ids":[1221010341,2694417911],"scores":[577,78],"type":"SEARCH_RESULTS"}

When you access it, providing an ID of an image that's already in the index, it will return a set of similar images that are also in the index. By default you will no longer have to re-upload an image to see what images are similar to it. Depending upon network latency, and the size of the image, this may have some performance improvements.

Additionally an optional image -> word cache is added (which can be enabled via a command-line option --cache-words) to dramatically improve performance, at the expense of memory usage.

Type	Time to Respond	Memory Usage
Cached In-Index Search	1.02-1.14s	877MB
Un-cached In-Index Search	2.40-2.71s	625MB
Image Upload Search	3.43-3.55s	n/a

This is with an index of 59,041 images at 419MB.

I'm sure many improvements can be made to this code, this is my first time writing C++ in many years so feedback is most appreciated! I'm planning on contributing a number of other pull requests as well. Namely being able to: configure the maximum number occurrences for a word, set a string name for an image instead of a number, and being able to set a default index location.

(This branch unfortunately includes @ryanfb's Mac-platform pull request #21, as I needed it to get it to build on my copy of OSX.)

magwyz commented 8 years ago

Hello John,

Thank you very much for your interesting contribution. Being a daily user of JQuery, I am honored to receive patches from you! I will look carefully at to your patches and come back with comments. For your information, I am myself working on a tag system that allows to associate in Pastec a string to each image.

magwyz commented 8 years ago

Hello John,

I just have general comments.

With big indexes (like 1 million images), if --cache-words is not used, the search by image id will be very slow. If it is used, Pastec will consume a lot more memory. This is exactly what your figures show but this should be documented. As the documentation is not currently on the github, I will take care of that.
With the --cache-words option, you are actually keeping in memory the forward index. https://en.wikipedia.org/wiki/Search_engine_indexing#The_forward_index What do you think about renaming the option to --forward-index?

Thanks!

jeresig commented 8 years ago

@magwyz Great call about the name and documentation. I've re-named the option to be --forward-index and have re-named a number of the variables and method arguments, as well. I've also merged with master to make sure I'm current.

Let me know if I can help with the documentation at all. Maybe moving a Markdown copy to Github might be useful?

magwyz commented 8 years ago

Thank you John for your contribution! Moving a Markdown copy of the documentation to Github would definitively make sense. It is on my TODO list.

jeresig commented 8 years ago

@magwyz Thank you so much for merging this -- I'm very happy to contribute! I will be sending some more pull requests your way quite soon.

magwyz / pastec

Add In-Index Similarity Search #23