Making a generalized method to ingest documents in any vector database.

Discussed in https://github.com/langchain-ai/langchain/discussions/15877

^{Originally posted by **arthurbrenno** January 11, 2024} I'm trying to make a **generalized method** that **works with any vector database**, so I can **change the implementation** later. The main focus is simply ingesting documents inside a vectorstore. My main issue always has been **_data persistency_** and the _flexibility_ to change the vector database implementation, because I like to explore different alternatives and there are many of them. That's why I'm (trying) to make a method like this. Here's a snippet of my implementation: ```py from langchain_community.document_loaders.base import BaseLoader from langchain_core.documents.base import Document from langchain_core.embeddings import Embeddings from langchain.vectorstores import VectorStore from conversai.core.services import BaseService from langchain.text_splitter import TextSplitter from typing import List, Type, Dict from typing import Any class IngestionService(BaseService): """Class that handles the ingestion of documents inside a vector store.""" async def execute(url: str, collection: str, loader_class: Type[BaseLoader], splitter: TextSplitter, embeddings: Embeddings, vector_store_cls: Type[VectorStore], **vector_store_kwargs) -> Dict[str, Any]: """Ingests a document inside the desired vector store. Args: url (str): Url of the document to ingest. collection (str): Collection of the vector store to ingest the document. loader (BaseLoader): The loader of the document. splitter (TextSplitter): The splitter of the document. embeddings (Embeddings): The embedding model to apply on the splitted documents. vector_store (VectorStore): The vector store to store the document. Returns: bool: True on success, False on failure. """ # Instantiating the loader loader: BaseLoader = loader_class(url) # Loading and splitting the document documents: List[Document] = loader.load_and_split(text_splitter=splitter) # Instantiating the vector store (assuming every implementation should support collection_name and embedding_func.) vector_store: VectorStore = vector_store_cls(collection_name=collection, embedding_func=embeddings, **vector_store_kwargs) # ids of the created texts texts_ids: List[str] = await vector_store.aadd_documents(documents) return {"success": True, "ingested_texts_ids": texts_ids} ``` The problem is that every vector database seems to have a different way of persisting itself (as I understood). For an example: Chroma: ```py db = Chroma(persist_directory=...) ``` (or using their separate client whitin Docker ```py db = Chroma(client=client) ``` FAISS: ```py # FAISS has db.save_local("faiss_index") ``` I need to ingest documents **at runtime.** This is important. I also need to group them inside `collections` for better searching later and group similar documents. Creating collections is one of my future implementations. My current vector store implementation is `Milvus`. And I think I'm generalising too much when I assume every vector db will have a "collection_name" and "embedding_func", like written in: ```py vector_store: VectorStore = vector_store_cls(collection_name=collection, embedding_func=embeddings, **vector_store_kwargs) ``` Any support on making my method better is appreciated. I'm also reading the documentation and trying to solve this for myself, but I think community help would also be a great startpoint. One of my suggestions (I've seen some people talking about this) is making the documentation more clear about data persistency. The examples only show short-term usage of the vectorstores, which is not the case in many production enviroments

🤖

Hello @arthurbrenno!

I'm Dosu, a friendly bot here to lend a helping hand with your LangChain-related questions, bugs, and contributions. While we're waiting for a human maintainer to step in, I'm more than happy to assist you. Let's work together on this!

Your approach to creating a generalized method for ingesting documents into any vector database is a good start. However, as you've noticed, different vector databases have different ways of persisting data. Here are some suggestions to improve your method:

Abstract Persistence Layer: You could create an abstract persistence layer that defines common methods for persisting and retrieving data. Each vector database class would then implement these methods in a way that's appropriate for that specific database. This would allow you to switch between different databases without changing the rest of your code.
Factory Pattern: You could use a factory pattern to create instances of your vector database classes. The factory would take in parameters like the type of database, the collection name, and the embedding function, and it would return an instance of the appropriate database class.
Adapter Pattern: You could use an adapter pattern to create a consistent interface for different databases. Each adapter would wrap a specific database and translate calls to the adapter into calls to the database's API.

Here's an example of how you might implement these ideas:

class AbstractVectorStore(ABC):
    @abstractmethod
    def persist(self, documents: List[Document]):
        pass

    @abstractmethod
    def retrieve(self, ids: List[str]) -> List[Document]:
        pass

class VectorStoreFactory:
    def create_vector_store(self, type: str, collection: str, embeddings: Embeddings, **kwargs) -> AbstractVectorStore:
        if type == 'Chroma':
            return ChromaVectorStore(collection, embeddings, **kwargs)
        elif type == 'FAISS':
            return FaissVectorStore(collection, embeddings, **kwargs)
        # Add more elif clauses for other types of vector stores
        else:
            raise ValueError(f"Unknown vector store type: {type}")

class ChromaVectorStore(AbstractVectorStore):
    def __init__(self, collection: str, embeddings: Embeddings, **kwargs):
        self.db = Chroma(persist_directory=kwargs.get('persist_directory'))

    def persist(self, documents: List[Document]):
        # Implement persistence for Chroma
        pass

    def retrieve(self, ids: List[str]) -> List[Document]:
        # Implement retrieval for Chroma
        pass

class FaissVectorStore(AbstractVectorStore):
    def __init__(self, collection: str, embeddings: Embeddings, **kwargs):
        self.db = Faiss()

    def persist(self, documents: List[Document]):
        # Implement persistence for FAISS
        pass

    def retrieve(self, ids: List[str]) -> List[Document]:
        # Implement retrieval for FAISS
        pass

In your IngestionService, you would then use the VectorStoreFactory to create an instance of the appropriate VectorStore:

class IngestionService(BaseService):
    async def execute(self, url: str, collection: str, loader_class: Type[BaseLoader], splitter: TextSplitter, embeddings: Embeddings, vector_store_type: str, **vector_store_kwargs) -> Dict[str, Any]:
        loader: BaseLoader = loader_class(url)
        documents: List[Document] = loader.load_and_split(text_splitter=splitter)
        vector_store: AbstractVectorStore = VectorStoreFactory().create_vector_store(vector_store_type, collection, embeddings, **vector_store_kwargs)
        vector_store.persist(documents)
        return {"success": True}

This approach should give you the flexibility to switch between different vector databases while keeping your ingestion code consistent. It also provides a clear separation of concerns, with each class responsible for a specific part of the process.

As for the documentation, I agree that it could be more clear about data persistency. I'll make a note of this and see if we can improve it in future versions of LangChain.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

langchain-ai / langchain

Making a generalized method to ingest documents in any vector database. #15944

Discussed in https://github.com/langchain-ai/langchain/discussions/15877

Sources