langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
46.26k stars 6.53k forks source link

Inquiry: Full-Text Search Support in Dify Supported Vector DBs #6880

Closed Q-Ot closed 1 month ago

Q-Ot commented 1 month ago

Self Checks

1. Is this request related to a challenge you're experiencing? Tell me about your story.

Hello, I am currently using Weaviate as a vector database and have encountered an issue where the full-text search portion consistently returns an empty list when performing full-text or hybrid searches. In this issue #6508 regarding full-text search, I noticed that contributors to dify pointed out that Weaviate does not support full-text search and recommended switching to Qdrant.

In the Dify Chinese documentation, I found that Dify supports vector databases such as Weaviate, Qdrant, Milvus, Zilliz, MyScale, and TiDB_Vector (btw, TiDB_Vector is not mentioned in the English documentation). I would like to know which of those above-mentioned vector databases support full-text search. Additionally, is the lack of full-text search support in Weaviate due to the vdb itself not providing such functionality?

2. Additional context or comments

No response

3. Can you help us with this feature?

dosubot[bot] commented 1 month ago

Hello @Q-Ot! I'm here to help with any bugs, questions, or contributions you have for the Dify repository. Let me know how I can assist you!

Milvus and Zilliz do not support full-text search. This is indicated by the comments in the search_by_full_text method:

def search_by_full_text(self, query: str, **kwargs: Any) -> list[Document]:
    # milvus/zilliz doesn't support bm25 search
    return []

Additionally, Milvus does not support full-text searching in versions below 2.3.x:

def search_by_full_text(self):
    # milvus does not support full text searching yet in < 2.3.x
    hits_by_full_text = self.vector.search_by_full_text(query=get_example_text())
    assert len(hits_by_full_text) == 0

Weaviate does support full-text search using BM25F, as shown in the search_by_full_text method:

def search_by_full_text(self, query: str, **kwargs: Any) -> list[Document]:
    """Return docs using BM25F.

    Args:
        query: Text to look up documents similar to.
        k: Number of Documents to return. Defaults to 4.

    Returns:
        List of Documents most similar to the query.
    """
    collection_name = self._collection_name
    content: dict[str, Any] = {"concepts": [query]}
    properties = self._attributes
    properties.append(Field.TEXT_KEY.value)
    if kwargs.get("search_distance"):
        content["certainty"] = kwargs.get("search_distance")
    query_obj = self._client.query.get(collection_name, properties)
    if kwargs.get("where_filter"):
        query_obj = query_obj.with_where(kwargs.get("where_filter"))
    if kwargs.get("additional"):
        query_obj = query_obj.with_additional(kwargs.get("additional"))
    properties = ['text']
    result = query_obj.with_bm25(query=query, properties=properties).with_limit(kwargs.get('top_k', 2)).do()
    if "errors" in result:
        raise ValueError(f"Error during query: {result['errors']}")
    docs = []
    for res in result["data"]["Get"][collection_name]:
        text = res.pop(Field.TEXT_KEY.value)
        docs.append(Document(page_content=text, metadata=res))
    return docs

Qdrant supports full-text search, implemented in Dify using the search_by_full_text method. This method utilizes the Qdrant client's filtering capabilities to match text within the page_content field of documents:

def search_by_full_text(self, query: str, **kwargs: Any) -> list[Document]:
    """Return docs most similar by bm25.
    Returns:
        List of documents most similar to the query text and distance for each.
    """
    from qdrant_client.http import models
    scroll_filter = models.Filter(
        must=[
            models.FieldCondition(
                key="group_id",
                match=models.MatchValue(value=self._group_id),
            ),
            models.FieldCondition(
                key="page_content",
                match=models.MatchText(text=query),
            )
        ]
    )
    response = self._client.scroll(
        collection_name=self._collection_name,
        scroll_filter=scroll_filter,
        limit=kwargs.get('top_k', 2),
        with_payload=True,
        with_vectors=True
    )
    results = response[0]
    documents = []
    for result in results:
        if result:
            documents.append(self._document_from_scored_point(
                result, Field.CONTENT_KEY.value, Field.METADATA_KEY.value
            ))

    return documents

MyScale supports full-text search, implemented as one of the retrieval options in high-quality indexing mode. The full-text search indexes all terms in the document, allowing users to query any terms and return text fragments containing those terms [1].

TiDB_Vector supports full-text search, implemented in Dify through the search_by_full_text method in the TencentVector class:

def search_by_full_text(self):
    hits_by_full_text = self.vector.search_by_full_text(query=get_example_text())
    assert len(hits_by_full_text) == 0

In summary:

To continue talking to Dosu, mention @dosu.

crazywoola commented 1 month ago

Qdrant does support full-text search.

Try search search_by_full_text in code repo.

Q-Ot commented 1 month ago

@dosu @crazywoola Thank you for your response. I've search search_by_full_text in code repo. May I assume that only Qdrant, MyScale, OpenSearch, Analyticdb_vector,oraclevector, Weaviate support full-text search? Additionally, in Weaviate, I've done some tests, the search_by_full_text method always returns an empty list.

mzh1996 commented 1 week ago

@dosu @crazywoola Thank you for your response. I've search search_by_full_text in code repo. May I assume that only Qdrant, MyScale, OpenSearch, Analyticdb_vector,oraclevector, Weaviate support full-text search? Additionally, in Weaviate, I've done some tests, the search_by_full_text method always returns an empty list.

+1 The returned list is empty even when I copied a sentence as the query from the doc in the corresponding dataset.