infiniflow / ragflow

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.
https://ragflow.io
Apache License 2.0
21.04k stars 2.06k forks source link

[Question]: Questions about vector retrieval and keyword retrieval in es #1899

Open muzhi1991 opened 2 months ago

muzhi1991 commented 2 months ago

Describe your problem

I noticed that es was used as a hybrid search solution (combining keyword search and vector search) in the project. In a simple test case, I found that es often failed to recall any results. When I read the code, I found that the vector search here used the query filter (the prerequisite of 60% keyword hits), which seemed to weaken the effect of the vector search. Why did you consider doing this? Or did I misunderstand?

https://github.com/infiniflow/ragflow/blob/fdd5b1b8cf58e3808cb3d47fd0731be40fc32d97/rag/nlp/search.py#L132

KevinHuSh commented 2 months ago

Firstly, vector search is computational costly, that's what the filter is for. Secondly, we noticed that vector search is not precise enough, I guess, that's why google primarily use keyword search instead of vector search.

muzhi1991 commented 2 months ago

Firstly, vector search is computational costly, that's what the filter is for. Secondly, we noticed that vector search is not precise enough, I guess, that's why google primarily use keyword search instead of vector search.

Thanks for your reply. In most current RAG solutions, hybrid search is generally a combination of traditional keyword search and vector retrieval (ANN), which are performed simultaneously, and then a fusion algorithm such as Reciprocal Rank Fusion (RRF) is used. Using keyword filtering as a precondition does not seem to be optimal,Especially when the user's query and document are very different