langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
42.96k stars 5.96k forks source link

tencent_vector has a bug for knowledge retrieval #6205

Open lennydou opened 1 month ago

lennydou commented 1 month ago

Self Checks

Dify version

0.6.13

Cloud or Self Hosted

Self Hosted (Source)

Steps to reproduce

use tencent vector db as vector db

  1. create a Knowledge and upload a text file
  2. copy a sentence from the file directly and make knowledge hit-testing
  3. you will find no doc segments is retrieved

REASON 1_WechatIMG74 in the pasted picture, i found the score in results searched from tencent db maybe bigger than 1, that makes

score = 1 - result.get("score", 0.0) if core > score_threshold: NOTICE: not entered here

the code file: core/rag/datasource/vdb/tencent/tencent_vector.py

✔️ Expected Behavior

copy a sentence from the file directly and make knowledge hit-testing, we surely can get doc segments retrieved

❌ Actual Behavior

sometimes we cannot

dosubot[bot] commented 1 month ago

To fix the issue in the tencent_vector module for knowledge retrieval in Dify version 0.6.13, where the score in results from the Tencent database may be greater than 1, you need to adjust the score calculation in the _get_search_res method. Specifically, you should ensure that the score is normalized or capped to a maximum of 1.

Here is the modified _get_search_res method:

def _get_search_res(self, res, score_threshold):
    docs = []
    if res is None or len(res) == 0:
        return docs

    for result in res[0]:
        meta = result.get(self.field_metadata)
        if meta is not None:
            meta = json.loads(meta)
        score = 1 - result.get("score", 0.0)
        score = min(score, 1.0)  # Ensure the score does not exceed 1
        if score > score_threshold:
            meta["score"] = score
            doc = Document(page_content=result.get(self.field_text), metadata=meta)
            docs.append(doc)

    return docs

This change ensures that the score is capped at 1, preventing any issues with document segments not being retrieved during knowledge hit-testing [1].

Additionally, you can refer to the integration tests provided in api/tests/integration_tests/vdb/tcvectordb/test_tencent.py to ensure that your changes are correctly implemented and tested. The tests include methods like search_by_vector and search_by_full_text which can help verify the functionality of the TencentVector module [2].

To continue talking to Dosu, mention @dosu.

lennydou commented 1 month ago
    score = 1 - result.get("score", 0.0)
    score = min(score, 1.0)  # Ensure the score does not exceed 1
    if score > score_threshold:

    This still does not work。 The code should be changed as follow:
    score = 1 - min(result.get("score", 0.0), 1.0)  # Ensure the score does not < 0
    if score >= score_threshold:
LeoLiuYan commented 3 weeks ago

Why the tencent db returns scores more than 1, is it better to ask the tecent vector db? @lennydou