qdrant / qdrant-haystack

An integration of Qdrant ANN vector database backend with Haystack
Apache License 2.0
43 stars 12 forks source link

DocumentStore improvements to support it in YAML pipelines #19

Closed TuanaCelik closed 1 year ago

TuanaCelik commented 1 year ago

Hey @kacperlukawski After a community member brought it up on our Discord that they were not able to use QdrantDocumentStore in YAML pipelines I did some investigating and found 2 issues, one of which is an easy fix and here is a colab to reproduce the issue(s) and some suggestions to resolve them:

  1. You'll notice that there's an error that says: Nodes cannot use variadic parameters like *args or **kwargs in their __init__ function. Hopefully, this will no longer be an issue once we move to Haystack v2. But, you can resolve this error by changing the init slightly. And it might be a good idea to include this yaml pipeline (which is a bare-bones pipeline) as a test in your test suite.
def __init__(
        self,
        location: Optional[str] = None,
        url: Optional[str] = None,
        port: int = 6333,
        grpc_port: int = 6334,
        prefer_grpc: bool = False,
        https: Optional[bool] = None,
        api_key: Optional[str] = None,
        prefix: Optional[str] = None,
        timeout: Optional[float] = None,
        host: Optional[str] = None,
        path: Optional[str] = None,
        index: str = "Document",
        embedding_dim: int = 768,
        content_field: str = "content",
        name_field: str = "name",
        embedding_field: str = "vector",
        similarity: str = "cosine",
        return_embedding: bool = False,
        progress_bar: bool = True,
        duplicate_documents: str = "overwrite",
        recreate_index: bool = False,
        shard_number: Optional[int] = None,
        replication_factor: Optional[int] = None,
        write_consistency_factor: Optional[int] = None,
        on_disk_payload: Optional[bool] = None,
        hnsw_config: Optional[Union[types.HnswConfigDiff, dict]] = None,
        optimizers_config: Optional[types.OptimizersConfigDiff] = None,
        wal_config: Optional[types.WalConfigDiff] = None,
        quantization_config: Optional[types.QuantizationConfig] = None,
        init_from: Optional[types.InitFrom] = None,
        client_kwargs: Optional[dict] = None,
    ):

        self.client = qdrant_client.QdrantClient(
            location=location,
            url=url,
            port=port,
            grpc_port=grpc_port,
            prefer_grpc=prefer_grpc,
            https=https,
            api_key=api_key,
            prefix=prefix,
            timeout=timeout,
            host=host,
            path=path,
            **client_kwargs,
        )
  1. And now comes the less nice issue. You will notice another log that says: ValidationError: {'name': 'DocumentStore', 'type': 'QdrantDocumentStore', 'params': {'host': ':memory:', 'index': 'Document', 'embedding_dim': 512, 'recreate_index': True}} is not valid under any of the given schemas Haystack YAML pipelines can only work with serializable objects. And some of the QdrantDocumentStore such as hnsw_config, optimizers_config, wal_config, quantization_config and init_from. One suggestion we had for this is providing a string to type dict. This way you can make it so that for the Haystack DocumentStore init you only pass the key of the type that you want to use as a string, and then the actual type is passed to the QdrantClient. The other option which is less ideal would be to remove these parameters until Haystack v2.
kacperlukawski commented 1 year ago

@TuanaCelik I guess all the parameters should already be serializable, as we also use Pydantic. I'll try to make things work, even with the current interface, except for the **kwargs.