Closed lqhl closed 3 weeks ago
@lqhl Hello! I'm here to assist you with any bugs, questions, or contributions. Let's work together to resolve the issue you're facing.
I found a similar open issue: dataset multiple retrieval, if score_threshold none, will return all documents, but now empty, which was closed by a pull request: fix score_threshold is none, return all top K documents on August 23, 2024 [1].
The issue you're encountering with vector/hybrid search failing for MyScale when score_threshold
is not set in Dify version 0.7.1 is due to the where_str
condition in the _search
method not being properly constructed when score_threshold
is not provided. The where_str
is only set if score_threshold
is greater than 0.0, which means it defaults to an empty string otherwise. This can lead to incorrect or incomplete SQL queries [2].
Here is the relevant code snippet:
def _search(self, dist: str, order: SortOrder, **kwargs: Any) -> list[Document]:
top_k = kwargs.get("top_k", 5)
score_threshold = kwargs.get("score_threshold", 0.0)
where_str = f"WHERE dist < {1 - score_threshold}" if \
self._metric.upper() == "COSINE" and order == SortOrder.ASC and score_threshold > 0.0 else ""
sql = f"""
SELECT text, vector, metadata, {dist} as dist FROM {self._config.database}.{self._collection_name}
{where_str} ORDER BY dist {order.value} LIMIT {top_k}
"""
try:
return [
Document(
page_content=r["text"],
vector=r['vector'],
metadata=r["metadata"],
)
for r in self._client.query(sql).named_results()
]
except Exception as e:
logging.error(f"\033[91m\033[1m{type(e)}\033[0m \033[95m{str(e)}\033[0m")
return []
The where_str
is only added if score_threshold
is greater than 0.0, which means if score_threshold
is not set (defaults to 0.0), the where_str
remains empty. This can cause the search query to fail or not filter results as expected.
Additionally, according to the Dify documentation, the Score Threshold
is used to set the similarity threshold for filtering text fragments. By default, this setting is turned off, meaning it does not filter the similarity values of recalled text fragments. When enabled, the default value is 0.5. This default behavior of having the Score Threshold
turned off can lead to the issues observed when it is not explicitly set [3].
To resolve this issue, you can either set the score_threshold
explicitly or update to a version where this bug is fixed.
To continue talking to Dosu, mention @dosu.
@lqhl Hello! I'm here to assist you with any bugs, questions, or contributions. Let's work together to resolve the issue you're facing.
I found a similar open issue: dataset multiple retrieval, if score_threshold none, will return all documents, but now empty, which was closed by a pull request: fix score_threshold is none, return all top K documents on August 23, 2024 [1].
The issue you're encountering with vector/hybrid search failing for MyScale when
score_threshold
is not set in Dify version 0.7.1 is due to thewhere_str
condition in the_search
method not being properly constructed whenscore_threshold
is not provided. Thewhere_str
is only set ifscore_threshold
is greater than 0.0, which means it defaults to an empty string otherwise. This can lead to incorrect or incomplete SQL queries [2].Here is the relevant code snippet:
def _search(self, dist: str, order: SortOrder, **kwargs: Any) -> list[Document]: top_k = kwargs.get("top_k", 5) score_threshold = kwargs.get("score_threshold", 0.0) where_str = f"WHERE dist < {1 - score_threshold}" if \ self._metric.upper() == "COSINE" and order == SortOrder.ASC and score_threshold > 0.0 else "" sql = f""" SELECT text, vector, metadata, {dist} as dist FROM {self._config.database}.{self._collection_name} {where_str} ORDER BY dist {order.value} LIMIT {top_k} """ try: return [ Document( page_content=r["text"], vector=r['vector'], metadata=r["metadata"], ) for r in self._client.query(sql).named_results() ] except Exception as e: logging.error(f"\033[91m\033[1m{type(e)}\033[0m \033[95m{str(e)}\033[0m") return []
The
where_str
is only added ifscore_threshold
is greater than 0.0, which means ifscore_threshold
is not set (defaults to 0.0), thewhere_str
remains empty. This can cause the search query to fail or not filter results as expected.Additionally, according to the Dify documentation, the
Score Threshold
is used to set the similarity threshold for filtering text fragments. By default, this setting is turned off, meaning it does not filter the similarity values of recalled text fragments. When enabled, the default value is 0.5. This default behavior of having theScore Threshold
turned off can lead to the issues observed when it is not explicitly set [3].To resolve this issue, you can either set the
score_threshold
explicitly or update to a version where this bug is fixed.To continue talking to Dosu, mention @dosu.
the bot is right and i've submitted a fix #7640
Self Checks
Dify version
0.7.1
Cloud or Self Hosted
Self Hosted (Docker)
Steps to reproduce
✔️ Expected Behavior
get the search results back.
❌ Actual Behavior
hit testing failed with error like this: