Embedding Search can‘t recall the right answer

cdg1921 commented 3 months ago

Self Checks

[X] This is only for bug report, if you would like to ask a question, please head to Discussions.
[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template :) and fill in all the required fields.

Dify version

0.6.11

Cloud or Self Hosted

Self Hosted (Source)

Steps to reproduce

When using the Embedding Search(the vector base is Weaviate)，it's recall item always can't match the exact answer. Economic search and full-text-search can easily get the right answer, while embedding search(semantic search) can't. We've tried many different queries, the same problem.

✔️ Expected Behavior

No response

❌ Actual Behavior

No response

dosubot[bot] commented 3 months ago

Hello @cdg1921! 👋 I'm here to assist you with bug solving, answering questions, and even becoming a contributor. Let's tackle this issue together!

The issue with the Embedding Search feature in Dify version 0.6.11, self-hosted using Weaviate, failing to recall items that match the exact answers could be due to several factors:

Segmentation and Cleaning: The document segmentation and cleaning process might not be optimal, leading to poor embedding quality. Proper segmentation and cleaning are crucial for ensuring high-quality embeddings that can accurately match user queries.
Indexing Method: The indexing method chosen might not be suitable for exact answer retrieval. The High-Quality Mode, which uses OpenAI's embedding interface, might not be as effective for exact matches compared to keyword-based indexing used in Economy Mode.
Retrieval Settings: The retrieval settings for Vector Search might not be configured correctly. The TopK value and Score threshold settings need to be fine-tuned to ensure that the most relevant text chunks are retrieved.
Rerank Model: The absence or misconfiguration of the Rerank model can affect the quality of the results. The Rerank model helps in semantically reranking the recalled document results to optimize the ranking.

To improve the recall of exact answers using Embedding Search, consider reviewing and optimizing the segmentation, cleaning, indexing method, and retrieval settings. Additionally, ensure that the reranking model is properly configured and enabled if necessary ^[1]^[2].

To continue talking to Dosu, mention @dosu.

cdg1921 commented 3 months ago

@dosu Hi, dosu. Thank you for your advises.
We create a new knowledge base in Weaviate, and use the same document segmentations from Dify's backen database in table document_segments using the same dataset_id. And the embedding model keep in the same as bge-large-zh-v1.5. We write the recall codes independently，we also can easily recall the right item in top5, but in Dify, the top10 items are all failed.

dosubot[bot] commented 3 months ago

It seems like the issue might be related to how Dify is handling the embedding search and ranking compared to your independent implementation. Here are a few areas to investigate:

Configuration Consistency: Ensure that the configuration settings in Dify for the embedding model, segmentation, and retrieval are identical to those in your independent implementation. This includes the WeaviateConfig, batch size, and any other relevant settings.
Rerank Model: Verify if Dify is using a rerank model and if it is properly configured. The rerank model can significantly impact the ranking of the search results. Ensure that the rerank model is enabled and correctly set up in Dify.
TopK and Score Threshold: Check the TopK value and score threshold settings in Dify. These settings determine how many top results are retrieved and filtered based on their similarity scores. Fine-tuning these parameters might help improve the recall accuracy.
Schema and Indexing: Ensure that the schema and indexing in Weaviate are correctly set up and consistent with your independent implementation. Any discrepancies in the schema or indexing process can lead to different search results.
Cross-Domain Issues: If there are any cross-domain configurations or CORS settings in Dify, ensure they are correctly set up. Cross-domain issues can sometimes affect the retrieval process.
Logs and Debugging: Check the logs and debugging information in Dify to identify any potential issues or errors during the embedding search process. This can provide insights into what might be going wrong.

By carefully reviewing and aligning these aspects, you should be able to identify and resolve the discrepancy between Dify and your independent implementation. If the issue persists, further investigation into the specific error messages or logs would be necessary.

To continue talking to Dosu, mention @dosu.

cdg1921 commented 3 months ago

We've found the reason, the embedding model bge-large-zh-v1.5 from ollama is a quantitative version, which performs pool. Thank you all the same!

lingaoan2024 commented 2 months ago

@cdg1921 So you didn't use Ollama in the end, what plan did you use?

cdg1921 commented 2 months ago

you can use Ollama to deploy a fp16 one.

langgenius / dify