langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
46.08k stars 6.5k forks source link

vector/hybrid search failed for MyScale when score_threshold is not set #7638

Closed lqhl closed 3 weeks ago

lqhl commented 3 weeks ago

Self Checks

Dify version

0.7.1

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

  1. choose myscale as the vdb.
  2. create a knowledgebase and import any documents
  3. try hit testing with vector search

✔️ Expected Behavior

get the search results back.

❌ Actual Behavior

hit testing failed with error like this:

api-1         | 2024-08-26 05:15:17,414.414 ERROR [Dummy-5] [hit_testing.py:82] - Hit testing failed.
api-1         | Traceback (most recent call last):
api-1         |   File "/app/api/controllers/console/datasets/hit_testing.py", line 56, in post
api-1         |     response = HitTestingService.retrieve(
api-1         |   File "/app/api/services/hit_testing_service.py", line 41, in retrieve
api-1         |     all_documents = RetrievalService.retrieve(retrival_method=retrieval_model.get('search_method', 'semantic_search'),
api-1         |   File "/app/api/core/rag/datasource/retrieval_service.py", line 90, in retrieve
api-1         |     raise Exception(exception_message)
api-1         | Exception: '>' not supported between instances of 'NoneType' and 'float'
api-1         | 2024-08-26 05:15:17,415.415 ERROR [Dummy-5] [app.py:838] - Exception on /console/api/datasets/7b9c2777-1505-440b-a4ae-ffb52bc8004e/hit-testing [POST]
api-1         | Traceback (most recent call last):
api-1         |   File "/app/api/controllers/console/datasets/hit_testing.py", line 56, in post
api-1         |     response = HitTestingService.retrieve(
api-1         |   File "/app/api/services/hit_testing_service.py", line 41, in retrieve
api-1         |     all_documents = RetrievalService.retrieve(retrival_method=retrieval_model.get('search_method', 'semantic_search'),
api-1         |   File "/app/api/core/rag/datasource/retrieval_service.py", line 90, in retrieve
api-1         |     raise Exception(exception_message)
api-1         | Exception: '>' not supported between instances of 'NoneType' and 'float'
dosubot[bot] commented 3 weeks ago

@lqhl Hello! I'm here to assist you with any bugs, questions, or contributions. Let's work together to resolve the issue you're facing.

I found a similar open issue: dataset multiple retrieval, if score_threshold none, will return all documents, but now empty, which was closed by a pull request: fix score_threshold is none, return all top K documents on August 23, 2024 [1].

The issue you're encountering with vector/hybrid search failing for MyScale when score_threshold is not set in Dify version 0.7.1 is due to the where_str condition in the _search method not being properly constructed when score_threshold is not provided. The where_str is only set if score_threshold is greater than 0.0, which means it defaults to an empty string otherwise. This can lead to incorrect or incomplete SQL queries [2].

Here is the relevant code snippet:

def _search(self, dist: str, order: SortOrder, **kwargs: Any) -> list[Document]:
    top_k = kwargs.get("top_k", 5)
    score_threshold = kwargs.get("score_threshold", 0.0)
    where_str = f"WHERE dist < {1 - score_threshold}" if \
        self._metric.upper() == "COSINE" and order == SortOrder.ASC and score_threshold > 0.0 else ""
    sql = f"""
        SELECT text, vector, metadata, {dist} as dist FROM {self._config.database}.{self._collection_name}
        {where_str} ORDER BY dist {order.value} LIMIT {top_k}
    """
    try:
        return [
            Document(
                page_content=r["text"],
                vector=r['vector'],
                metadata=r["metadata"],
            )
            for r in self._client.query(sql).named_results()
        ]
    except Exception as e:
        logging.error(f"\033[91m\033[1m{type(e)}\033[0m \033[95m{str(e)}\033[0m")
        return []

The where_str is only added if score_threshold is greater than 0.0, which means if score_threshold is not set (defaults to 0.0), the where_str remains empty. This can cause the search query to fail or not filter results as expected.

Additionally, according to the Dify documentation, the Score Threshold is used to set the similarity threshold for filtering text fragments. By default, this setting is turned off, meaning it does not filter the similarity values of recalled text fragments. When enabled, the default value is 0.5. This default behavior of having the Score Threshold turned off can lead to the issues observed when it is not explicitly set [3].

To resolve this issue, you can either set the score_threshold explicitly or update to a version where this bug is fixed.

To continue talking to Dosu, mention @dosu.

lqhl commented 3 weeks ago

@lqhl Hello! I'm here to assist you with any bugs, questions, or contributions. Let's work together to resolve the issue you're facing.

I found a similar open issue: dataset multiple retrieval, if score_threshold none, will return all documents, but now empty, which was closed by a pull request: fix score_threshold is none, return all top K documents on August 23, 2024 [1].

The issue you're encountering with vector/hybrid search failing for MyScale when score_threshold is not set in Dify version 0.7.1 is due to the where_str condition in the _search method not being properly constructed when score_threshold is not provided. The where_str is only set if score_threshold is greater than 0.0, which means it defaults to an empty string otherwise. This can lead to incorrect or incomplete SQL queries [2].

Here is the relevant code snippet:

def _search(self, dist: str, order: SortOrder, **kwargs: Any) -> list[Document]:
    top_k = kwargs.get("top_k", 5)
    score_threshold = kwargs.get("score_threshold", 0.0)
    where_str = f"WHERE dist < {1 - score_threshold}" if \
        self._metric.upper() == "COSINE" and order == SortOrder.ASC and score_threshold > 0.0 else ""
    sql = f"""
        SELECT text, vector, metadata, {dist} as dist FROM {self._config.database}.{self._collection_name}
        {where_str} ORDER BY dist {order.value} LIMIT {top_k}
    """
    try:
        return [
            Document(
                page_content=r["text"],
                vector=r['vector'],
                metadata=r["metadata"],
            )
            for r in self._client.query(sql).named_results()
        ]
    except Exception as e:
        logging.error(f"\033[91m\033[1m{type(e)}\033[0m \033[95m{str(e)}\033[0m")
        return []

The where_str is only added if score_threshold is greater than 0.0, which means if score_threshold is not set (defaults to 0.0), the where_str remains empty. This can cause the search query to fail or not filter results as expected.

Additionally, according to the Dify documentation, the Score Threshold is used to set the similarity threshold for filtering text fragments. By default, this setting is turned off, meaning it does not filter the similarity values of recalled text fragments. When enabled, the default value is 0.5. This default behavior of having the Score Threshold turned off can lead to the issues observed when it is not explicitly set [3].

To resolve this issue, you can either set the score_threshold explicitly or update to a version where this bug is fixed.

To continue talking to Dosu, mention @dosu.

the bot is right and i've submitted a fix #7640