satishsilveri / Semantic-Search

6 stars 1 forks source link

still it is not clear what is so good in elastic search ? #1

Open Sandy4321 opened 10 months ago

Sandy4321 commented 10 months ago

still it is not clear what is so good in elastic search ? https://medium.com/nerd-for-tech/enhancing-faq-search-engines-harnessing-the-power-of-knn-in-elasticsearch-76076f670580

Sandy4321 commented 10 months ago

image

https://medium.com/@mutahar789/optimizing-rag-a-guide-to-choosing-the-right-vector-database-480f71a33139 Full-text Search databases (ElasticSearch, OpenSearch) Full-text search databases, such as ElasticSearch and OpenSearch, excel in facilitating comprehensive text search and enabling advanced analytics. However, they underperform compared to dedicated vector databases when it comes to conducting vector similarity searches and managing high-dimensional data. These databases often require augmentation with other tools for semantic search since they do not make use of vector indexing, but only rely on inverted index. The performance of Elasticsearch lags behind that of Weaviate, Milvus, and Qdrant, as evidenced by the results from the Qdrant benchmarks. Notably, Elasticsearch exhibits significant latency and limited throughput across all three datasets employed for benchmarking purposes.

satishsilveri commented 10 months ago

Hi Sandy,

The purpose of this repository is to explore and work with different technologies involved in semantic search space. I don't intend to do a comparative analysis specifically for the article/notebook in question. I just wanted to demonstrate the Elasticsearch's KNN Hybrid search capabilities as I worked with Elastic and Solr in the past. But to answer your questions, following thoughts come to my mind:

  1. Selecting a vector database is subjective to many factors including the type of organization you are part of. For e.g if you work for a startup, its much more easier to introduce a new tech stack within limited amount of time compared to being a part of a large organization where you might need to consider risk factors, cost and other integration challenges with existing technologies.
  2. Based on my understanding and working different vector databases, the most popular algorithm for performing vector search is HNSW which is the fundamental part of architecture for most of the vector databases. The algorithm is the same but the features/functionalities might defer from database to database.
  3. Databases like Elastic and Solr that are based on lucene index both support HNSW algorithm for semantic search. The main difference is that Elastic is commercial while Solr is open source. There is a really nice paper(in a way a rebuttal) demonstrating that there is no need for a dedicated/specialized vector database to setup a semantic search pipeline unless you need a specific feature for your pipeline that is offered by these specialized databases.
  4. When it comes to latency, yes, some databases might outperform others but the question to ask is whether the latency provided by either Solr or Elastic make sense for you or not. For e.g. lets say the target latency for your search systems is 1 sec, so one needs to check if the vector search itself does not take more than 40%(or max 50%) of that 1 second (the other parts include generating query embeddings and some buffer for network). And this can be achieved by vertically and horizontally scaling the infrastructure.

To summarize, you can choose whatever database you want for semantic search based on your budget, requirements, tech stack integration and other parameters but the underlying algorithm remains the same. You can implement a low latency search platform using open source technologies like Solr without pouring in money for a specialized databases. The goal should be to implement a solution in a cost effective manner without losing too many capabilities. I have demonstrated Elastic because I work with Elastic on daily basis so it was easy to integrate but you can easily translate the semantic search setup to an open-source technology like Solr.