magwyz / pastec

Image recognition open source index and search engine
http://pastec.io
GNU Lesser General Public License v3.0
620 stars 175 forks source link

Scaling #37

Closed sosedoff closed 8 years ago

sosedoff commented 8 years ago

Hey there, first of all - thanks for the awesome tool, i tried using the pastec via docker image and it turned out to be great for a very specific task - matching images.

However i found that its hard to scale pastec server under load. What that mean is - there are always multiple write and read operation, where write is making a PUT request to index a new image and read being a search operation for a user-generated image. My current setup is pretty simple, 8GB ram droplet on DigitalOcean with 10 write workers that are constantly adding images and a few search workers that actually perform similarity searches.

I havent dug into the codebase, but from what i see right now it looks like that backend is being very slow when running search queries during the "inserts". In the setup i described above it takes approximately 12-16 seconds to run a single search query, with user provided image being ~ 50kb, 300px wide.

So my question is pretty simple - how does one scale out pastec? I know there's a paid offering on the website, but i was wondering if there's something could be done before moving onto paid plain.

magwyz commented 8 years ago

There is a global read-write lock that protects the index. The current version of Pastec does not support well concurrent insertion and reading. It would be better to dedicate one instance to image addition and an other one to search requests and synchronize them at some point.

Also, I have previously observed very poor performance with virtual machines of cloud providers as Pastec does an intensive usage of RAM and CPU. I always advise people to use physical machines.