Recommendations for hosting DeepSearch online

ewajs commented 6 months ago

Hey, Hope you're doing great! First of all thank you so much for this great OS Library. I've been playing with it for some time and am finding it very useful!

I was wondering if you have any recommendations in regards to hosting this as a service online, namely the specs needed to have it run relatively fast (<1 sec).

We have a library of ~3K images (stored in AWS S3, but I've already built the index offline) and would like to have an API-kind of service that takes in a URL (of another S3 image not in the library) and return the N closest matches. Maybe a Lambda would suffice?

Another thing I was wondering was whether I'd need to keep a copy of the images alongside the index in order for the script not to delete them from the index (I'd like to avoid this, I built the index with resized versions of the images and renamed the files to match database IDs to have consistency and would like to avoid the need of keeping the files themselves in sync).

Thanks in advance and sorry to bother you! This is definitely not an issue in itself, just a request for advice.

martinholecekmax commented 6 months ago

Hey there,

First off, I’m really glad to hear that you’re finding the DeepSearch library useful! It’s always great to see the community engaging with the project and exploring its potential applications.

Regarding your query about hosting the service online, while DeepSearch offers a robust starting point for search functionalities, deploying it for the specific use case you mentioned might not be its best application. Here are a few suggestions based on your requirements:

Vector Database: For managing and searching through a library of ~3K images efficiently, I’d recommend using a vector database like Milvus, Pinecone, or similar. These platforms are designed to handle large-scale image datasets and can significantly speed up the search process.

API Backend: To expose your search functionality as a service, consider using a framework like FastAPI or Django. Both are excellent choices for building a fast and scalable API backend. FastAPI, in particular, is known for its high performance and ease of use, which might be beneficial for your use case.

Task Queues: Given the nature of image search, processing requests might take a bit longer, especially as your library grows. Integrating a task queue service like Celery could help manage long-running searches more efficiently. This way, you can handle search tasks asynchronously, ensuring that your API remains responsive.

Regarding your concerns about hosting and managing images, it’s generally a good practice to keep a copy of the images or their processed versions alongside the index. This approach simplifies the retrieval process and avoids inconsistencies. However, if you're concerned about storage and synchronization, ensuring that your index and database IDs are meticulously managed can mitigate some of these issues. Yet, the architecture of vector databases and task queues might offer more elegant solutions to handle these challenges efficiently.

I hope these suggestions help you in setting up your service. Feel free to reach out if you have any more questions or need further assistance. And no bother at all – I'm here to help!

Best, Martin

ewajs commented 5 months ago

Hey thank you so much for the thorough response!! This is a pretty clear outline of what I was looking for! I have one additional question if I might, and pardon my ignorance in some of the ML:

I've already experimented a lot with different models, metrics and n_trees until I found a setup that yielded the best matches, so ideally I want to maintain that in whatever hosted solution I finally go for, ie. I want that system to output the same vectors it's doing now for a given image, but store those on a vector database instead (I'm guessing that replaces annoy in some sense) and when querying I would take an input image, do inference with the same model in the same setup and do a lookup on the database, which probably supports associating metadata for a vector which would "solve" my need to keep the images with a particular filename/hosted along the database file in this case.

So the architecture if I were to do that could be something like this:

I would need to extract out from DeepSearch.py the code associated with ModelLoading and inference (ie. methods like extract.
Replace the code related to storing the output vector into Annoy that with a call to store in the vector database (ie. methods like set_path, start_feature_extraction, etc.).
Do the same for search (ie. methods like get_similar_images).
Wrap into a webserver and host it.

And for that I'm guessing there might already be a cloud provider that could solve 1 to 3 behind an API and passing/setting the model params, right? And if I were to self host, what kind of instance/runtime do you suggest?

Thanks again for your time and dedication!

martinholecekmax / DeepSearch

Recommendations for hosting DeepSearch online #1