langflow-ai / langflow

Langflow is a low-code app builder for RAG and multi-agent AI applications. It’s Python-based and agnostic to any model, API, or database.
http://www.langflow.org
MIT License
34.47k stars 4.15k forks source link

Qdrant Vector Store doesnt' use the advanced fields when ingesting documents #3749

Closed morrizon closed 1 month ago

morrizon commented 2 months ago

Bug Description

Qdrant Vector Store is not using the configuration fields when ingesting documents so it will only work for the default case (qdrant service running in localhost in port 6333). Trying to use a different host or using url and API Key will not change the behaviour.

Reproduction

The test was performed in docker and we tried 2 different Qdrant configurations:

  1. host field with value "qdrant" when running qdrant via docker compose[1]
  2. url like https://xxx.europe-west3-0.gcp.cloud.qdrant.io:6333 with API key (without local qdrant service in docker)

In both cases the component was giving error #99[2]

After debugging we saw that the error was triggered in the line 92 of the Qdrant implementation: https://github.com/langflow-ai/langflow/blob/96ca71dab855639f82492c225f044d1a212bcdaa/src/backend/base/langflow/components/vectorstores/Qdrant.py#L92

The configuration fields are not part of the argument that are only used in the case of no documents[3].

To fix the issue we modified the previous line:

qdrant = Qdrant.from_documents(documents, embedding=self.embedding, **qdrant_kwargs, **server_kwargs)

After the change it worked like a charm in both cases (using other host or using url/apikey).

[1] simplified docker compose. Mind that we didn't add the env variables, volumes or network. In our test we used Traefik as balancer with Let's Encrypt certificates.

services:
    langflow:
        image: langflowai/langflow:1.0.17
    qdrant:
        image: qdrant/qdrant

[2] Qdrant error image

[3] the fields are used in the variableserver_kwargs that is only used in the case of no documents image

Expected behavior

Qdrant should use the advanced fields also when ingesting documents. A possible solution was commented in the reproduction.

Who can help?

I mention @nicoloboschi because was the last person working specifically in this file. If I saw there is not activity, I will check to do the fix myself in the future

Operating System

Debian 12.5 (bookworm)

Langflow Version

1.0.17

Python Version

3.12

Screenshot

No response

Flow File

No response

dosubot[bot] commented 2 months ago

Hey @morrizon, I'm here to help you with any bugs, questions, or contributions you have. Let's tackle this issue together!

The issue you're encountering with the Qdrant Vector Store not using advanced configuration fields when ingesting documents is due to the Qdrant.from_documents method not correctly passing these parameters to the QdrantClient. To fix this, you need to modify the build_vector_store method in the QdrantVectorStoreComponent class to ensure that all relevant parameters are passed.

Here's the corrected code:

class QdrantVectorStoreComponent(LCVectorStoreComponent):
    display_name = "Qdrant"
    description = "Qdrant Vector Store with search capabilities"
    documentation = "https://python.langchain.com/docs/modules/data_connection/vectorstores/integrations/qdrant"
    icon = "Qdrant"

    inputs = [
        StrInput(name="collection_name", display_name="Collection Name", required=True),
        StrInput(name="host", display_name="Host", value="localhost", advanced=True),
        IntInput(name="port", display_name="Port", value=6333, advanced=True),
        IntInput(name="grpc_port", display_name="gRPC Port", value=6334, advanced=True),
        SecretStrInput(name="api_key", display_name="API Key", advanced=True),
        StrInput(name="prefix", display_name="Prefix", advanced=True),
        IntInput(name="timeout", display_name="Timeout", advanced=True),
        StrInput(name="path", display_name="Path", advanced=True),
        StrInput(name="url", display_name="URL", advanced=True),
        DropdownInput(
            name="distance_func",
            display_name="Distance Function",
            options=["Cosine", "Euclidean", "Dot Product"],
            value="Cosine",
            advanced=True,
        ),
        StrInput(name="content_payload_key", display_name="Content Payload Key", value="page_content", advanced=True),
        StrInput(name="metadata_payload_key", display_name="Metadata Payload Key", value="metadata", advanced=True),
        MultilineInput(name="search_query", display_name="Search Query"),
        DataInput(
            name="ingest_data",
            display_name="Ingest Data",
            is_list=True,
        ),
        HandleInput(name="embedding", display_name="Embedding", input_types=["Embeddings"]),
        IntInput(
            name="number_of_results",
            display_name="Number of Results",
            info="Number of results to return.",
            value=4,
            advanced=True,
        ),
    ]

    @check_cached_vector_store
    def build_vector_store(self) -> Qdrant:
        qdrant_kwargs = {
            "collection_name": self.collection_name,
            "content_payload_key": self.content_payload_key,
            "metadata_payload_key": self.metadata_payload_key,
        }

        server_kwargs = {
            "host": self.host if self.host else None,
            "port": int(self.port),  # Ensure port is an integer
            "grpc_port": int(self.grpc_port),  # Ensure grpc_port is an integer
            "api_key": self.api_key,
            "prefix": self.prefix,
            "timeout": int(self.timeout) if self.timeout else None,  # Ensure timeout is an integer
            "path": self.path if self.path else None,
            "url": self.url if self.url else None,
        }

        server_kwargs = {k: v for k, v in server_kwargs.items() if v is not None}
        documents = []

        for _input in self.ingest_data or []:
            if isinstance(_input, Data):
                documents.append(_input.to_lc_document())
            else:
                documents.append(_input)

        if not isinstance(self.embedding, Embeddings):
            raise ValueError("Invalid embedding object")

        if documents:
            qdrant = Qdrant.from_documents(documents, embedding=self.embedding, **qdrant_kwargs, **server_kwargs)
        else:
            from qdrant_client import QdrantClient

            client = QdrantClient(**server_kwargs)
            qdrant = Qdrant(embeddings=self.embedding, client=client, **qdrant_kwargs)

        return qdrant

    def search_documents(self) -> List[Data]:
        vector_store = self.build_vector_store()

        if self.search_query and isinstance(self.search_query, str) and self.search_query.strip():
            docs = vector_store.similarity_search(
                query=self.search_query,
                k=self.number_of_results,
            )

            data = docs_to_data(docs)
            self.status = data
            return data
        else:
            return []

This modification ensures that the url, host, and api_key fields, among others, are correctly passed to the QdrantClient, allowing the advanced configuration fields to be used when ingesting documents [1][2].

To continue talking to Dosu, mention @dosu.