Closed jeresig closed 8 years ago
Hello John,
Thank you very much for your interesting contribution. Being a daily user of JQuery, I am honored to receive patches from you! I will look carefully at to your patches and come back with comments. For your information, I am myself working on a tag system that allows to associate in Pastec a string to each image.
Hello John,
I just have general comments.
--cache-words
is not used, the search by image id will be very slow. If it is used, Pastec will consume a lot more memory. This is exactly what your figures show but this should be documented. As the documentation is not currently on the github, I will take care of that.--cache-words
option, you are actually keeping in memory the forward index.
https://en.wikipedia.org/wiki/Search_engine_indexing#The_forward_index
What do you think about renaming the option to --forward-index
?Thanks!
@magwyz Great call about the name and documentation. I've re-named the option to be --forward-index
and have re-named a number of the variables and method arguments, as well. I've also merged with master to make sure I'm current.
Let me know if I can help with the documentation at all. Maybe moving a Markdown copy to Github might be useful?
Thank you John for your contribution! Moving a Markdown copy of the documentation to Github would definitively make sense. It is on my TODO list.
@magwyz Thank you so much for merging this -- I'm very happy to contribute! I will be sending some more pull requests your way quite soon.
Thank you so much for creating the amazing Pastec project, @magwyz! This pull request adds a new API endpoint:
When you access it, providing an ID of an image that's already in the index, it will return a set of similar images that are also in the index. By default you will no longer have to re-upload an image to see what images are similar to it. Depending upon network latency, and the size of the image, this may have some performance improvements.
Additionally an optional image -> word cache is added (which can be enabled via a command-line option
--cache-words
) to dramatically improve performance, at the expense of memory usage.This is with an index of 59,041 images at 419MB.
I'm sure many improvements can be made to this code, this is my first time writing C++ in many years so feedback is most appreciated! I'm planning on contributing a number of other pull requests as well. Namely being able to: configure the maximum number occurrences for a word, set a string name for an image instead of a number, and being able to set a default index location.
(This branch unfortunately includes @ryanfb's Mac-platform pull request #21, as I needed it to get it to build on my copy of OSX.)