vector/hybrid search failed for MyScale when score_threshold is not set

Self Checks

[X] This is only for bug report, if you would like to ask a question, please head to Discussions.
[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template :) and fill in all the required fields.

Dify version

0.7.1

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

choose myscale as the vdb.
create a knowledgebase and import any documents
try hit testing with vector search

✔️ Expected Behavior

get the search results back.

❌ Actual Behavior

hit testing failed with error like this:

api-1         | 2024-08-26 05:15:17,414.414 ERROR [Dummy-5] [hit_testing.py:82] - Hit testing failed.
api-1         | Traceback (most recent call last):
api-1         |   File "/app/api/controllers/console/datasets/hit_testing.py", line 56, in post
api-1         |     response = HitTestingService.retrieve(
api-1         |   File "/app/api/services/hit_testing_service.py", line 41, in retrieve
api-1         |     all_documents = RetrievalService.retrieve(retrival_method=retrieval_model.get('search_method', 'semantic_search'),
api-1         |   File "/app/api/core/rag/datasource/retrieval_service.py", line 90, in retrieve
api-1         |     raise Exception(exception_message)
api-1         | Exception: '>' not supported between instances of 'NoneType' and 'float'
api-1         | 2024-08-26 05:15:17,415.415 ERROR [Dummy-5] [app.py:838] - Exception on /console/api/datasets/7b9c2777-1505-440b-a4ae-ffb52bc8004e/hit-testing [POST]
api-1         | Traceback (most recent call last):
api-1         |   File "/app/api/controllers/console/datasets/hit_testing.py", line 56, in post
api-1         |     response = HitTestingService.retrieve(
api-1         |   File "/app/api/services/hit_testing_service.py", line 41, in retrieve
api-1         |     all_documents = RetrievalService.retrieve(retrival_method=retrieval_model.get('search_method', 'semantic_search'),
api-1         |   File "/app/api/core/rag/datasource/retrieval_service.py", line 90, in retrieve
api-1         |     raise Exception(exception_message)
api-1         | Exception: '>' not supported between instances of 'NoneType' and 'float'

@lqhl Hello! I'm here to assist you with any bugs, questions, or contributions. Let's work together to resolve the issue you're facing.

I found a similar open issue: dataset multiple retrieval, if score_threshold none, will return all documents, but now empty, which was closed by a pull request: fix score_threshold is none, return all top K documents on August 23, 2024 ^[1].

The issue you're encountering with vector/hybrid search failing for MyScale when score_threshold is not set in Dify version 0.7.1 is due to the where_str condition in the _search method not being properly constructed when score_threshold is not provided. The where_str is only set if score_threshold is greater than 0.0, which means it defaults to an empty string otherwise. This can lead to incorrect or incomplete SQL queries ^[2].

Here is the relevant code snippet:

def _search(self, dist: str, order: SortOrder, **kwargs: Any) -> list[Document]:
    top_k = kwargs.get("top_k", 5)
    score_threshold = kwargs.get("score_threshold", 0.0)
    where_str = f"WHERE dist < {1 - score_threshold}" if \
        self._metric.upper() == "COSINE" and order == SortOrder.ASC and score_threshold > 0.0 else ""
    sql = f"""
        SELECT text, vector, metadata, {dist} as dist FROM {self._config.database}.{self._collection_name}
        {where_str} ORDER BY dist {order.value} LIMIT {top_k}
    """
    try:
        return [
            Document(
                page_content=r["text"],
                vector=r['vector'],
                metadata=r["metadata"],
            )
            for r in self._client.query(sql).named_results()
        ]
    except Exception as e:
        logging.error(f"\033[91m\033[1m{type(e)}\033[0m \033[95m{str(e)}\033[0m")
        return []

The where_str is only added if score_threshold is greater than 0.0, which means if score_threshold is not set (defaults to 0.0), the where_str remains empty. This can cause the search query to fail or not filter results as expected.

Additionally, according to the Dify documentation, the Score Threshold is used to set the similarity threshold for filtering text fragments. By default, this setting is turned off, meaning it does not filter the similarity values of recalled text fragments. When enabled, the default value is 0.5. This default behavior of having the Score Threshold turned off can lead to the issues observed when it is not explicitly set ^[3].

To resolve this issue, you can either set the score_threshold explicitly or update to a version where this bug is fixed.

To continue talking to Dosu, mention @dosu.

@lqhl Hello! I'm here to assist you with any bugs, questions, or contributions. Let's work together to resolve the issue you're facing.

I found a similar open issue: dataset multiple retrieval, if score_threshold none, will return all documents, but now empty, which was closed by a pull request: fix score_threshold is none, return all top K documents on August 23, 2024 [1].

The issue you're encountering with vector/hybrid search failing for MyScale when score_threshold is not set in Dify version 0.7.1 is due to the where_str condition in the _search method not being properly constructed when score_threshold is not provided. The where_str is only set if score_threshold is greater than 0.0, which means it defaults to an empty string otherwise. This can lead to incorrect or incomplete SQL queries [2].

Here is the relevant code snippet:
def _search(self, dist: str, order: SortOrder, **kwargs: Any) -> list[Document]:
    top_k = kwargs.get("top_k", 5)
    score_threshold = kwargs.get("score_threshold", 0.0)
    where_str = f"WHERE dist < {1 - score_threshold}" if \
        self._metric.upper() == "COSINE" and order == SortOrder.ASC and score_threshold > 0.0 else ""
    sql = f"""
        SELECT text, vector, metadata, {dist} as dist FROM {self._config.database}.{self._collection_name}
        {where_str} ORDER BY dist {order.value} LIMIT {top_k}
    """
    try:
        return [
            Document(
                page_content=r["text"],
                vector=r['vector'],
                metadata=r["metadata"],
            )
            for r in self._client.query(sql).named_results()
        ]
    except Exception as e:
        logging.error(f"\033[91m\033[1m{type(e)}\033[0m \033[95m{str(e)}\033[0m")
        return []
The where_str is only added if score_threshold is greater than 0.0, which means if score_threshold is not set (defaults to 0.0), the where_str remains empty. This can cause the search query to fail or not filter results as expected.

Additionally, according to the Dify documentation, the Score Threshold is used to set the similarity threshold for filtering text fragments. By default, this setting is turned off, meaning it does not filter the similarity values of recalled text fragments. When enabled, the default value is 0.5. This default behavior of having the Score Threshold turned off can lead to the issues observed when it is not explicitly set [3].

To resolve this issue, you can either set the score_threshold explicitly or update to a version where this bug is fixed.

To continue talking to Dosu, mention @dosu.

the bot is right and i've submitted a fix #7640

langgenius / dify