tencent_vector has a bug for knowledge retrieval

lennydou commented 1 month ago

Self Checks

[X] This is only for bug report, if you would like to ask a question, please head to Discussions.
[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template :) and fill in all the required fields.

Dify version

0.6.13

Cloud or Self Hosted

Self Hosted (Source)

Steps to reproduce

use tencent vector db as vector db

create a Knowledge and upload a text file
copy a sentence from the file directly and make knowledge hit-testing
you will find no doc segments is retrieved

REASON 1_WechatIMG74 in the pasted picture, i found the score in results searched from tencent db maybe bigger than 1, that makes

score = 1 - result.get("score", 0.0) if core > score_threshold: NOTICE: not entered here

the code file: core/rag/datasource/vdb/tencent/tencent_vector.py

✔️ Expected Behavior

copy a sentence from the file directly and make knowledge hit-testing, we surely can get doc segments retrieved

❌ Actual Behavior

sometimes we cannot

dosubot[bot] commented 1 month ago

To fix the issue in the tencent_vector module for knowledge retrieval in Dify version 0.6.13, where the score in results from the Tencent database may be greater than 1, you need to adjust the score calculation in the _get_search_res method. Specifically, you should ensure that the score is normalized or capped to a maximum of 1.

Here is the modified _get_search_res method:

def _get_search_res(self, res, score_threshold):
    docs = []
    if res is None or len(res) == 0:
        return docs

    for result in res[0]:
        meta = result.get(self.field_metadata)
        if meta is not None:
            meta = json.loads(meta)
        score = 1 - result.get("score", 0.0)
        score = min(score, 1.0)  # Ensure the score does not exceed 1
        if score > score_threshold:
            meta["score"] = score
            doc = Document(page_content=result.get(self.field_text), metadata=meta)
            docs.append(doc)

    return docs

This change ensures that the score is capped at 1, preventing any issues with document segments not being retrieved during knowledge hit-testing ^[1].

Additionally, you can refer to the integration tests provided in api/tests/integration_tests/vdb/tcvectordb/test_tencent.py to ensure that your changes are correctly implemented and tested. The tests include methods like search_by_vector and search_by_full_text which can help verify the functionality of the TencentVector module ^[2].

To continue talking to Dosu, mention @dosu.

lennydou commented 1 month ago

    score = 1 - result.get("score", 0.0)
    score = min(score, 1.0)  # Ensure the score does not exceed 1
    if score > score_threshold:

    This still does not work。 The code should be changed as follow:
    score = 1 - min(result.get("score", 0.0), 1.0)  # Ensure the score does not < 0
    if score >= score_threshold:

LeoLiuYan commented 3 weeks ago

Why the tencent db returns scores more than 1, is it better to ask the tecent vector db? @lennydou

langgenius / dify