langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
88.61k stars 13.92k forks source link

Add Dynamic Schema support for the Milvus vector store #15690

Open ohbeep opened 5 months ago

ohbeep commented 5 months ago

Feature request

I suggest supporting the Milvus vector database's new Dynamic Schema feature.

Motivation

According to Milvus:

Dynamic schema enables users to insert entities with new fields into a Milvus collection without modifying the existing schema. This means that users can insert data without knowing the full schema of a collection and can include fields that are not yet defined.

I think it is good to allow Langchain to have this feature when multiple types or schema of documents are added to the database.

Your contribution

I propose to add a "dynamic_schema" flag to the __init__ and from_texts method of the Milvus class:

__init__ method:

https://github.com/langchain-ai/langchain/blob/4c47f39fcb539fdeff6dd6d9b1f483cd9a1af69b/libs/community/langchain_community/vectorstores/milvus.py#L107-L125

Change to:

    def __init__(
        self,
        embedding_function: Embeddings,
        collection_name: str = "LangChainCollection",
        collection_description: str = "",
        connection_args: Optional[dict[str, Any]] = None,
        consistency_level: str = "Session",
        index_params: Optional[dict] = None,
        search_params: Optional[dict] = None,
        drop_old: Optional[bool] = False,
        *,
        primary_field: str = "pk",
        text_field: str = "text",
        vector_field: str = "vector",
        metadata_field: Optional[str] = None,
        partition_names: Optional[list] = None,
        replica_number: int = 1,
        timeout: Optional[float] = None,
        dynamic_schema = False,
    ):

from_texts method:

https://github.com/langchain-ai/langchain/blob/4c47f39fcb539fdeff6dd6d9b1f483cd9a1af69b/libs/community/langchain_community/vectorstores/milvus.py#L839-L887

Change to:

    def from_texts(
        cls,
        texts: List[str],
        embedding: Embeddings,
        metadatas: Optional[List[dict]] = None,
        collection_name: str = "LangChainCollection",
        connection_args: dict[str, Any] = DEFAULT_MILVUS_CONNECTION,
        consistency_level: str = "Session",
        index_params: Optional[dict] = None,
        search_params: Optional[dict] = None,
        drop_old: bool = False,
        dynamic_schema = False,
        **kwargs: Any,
    ) -> Milvus:

I may later submit a PR for this suggestion.

dosubot[bot] commented 5 months ago

🤖

Hey @ohbeep! I'm Dosu, a friendly bot here to assist while we wait for a human maintainer. I can help you solve bugs, answer questions, and guide you through contributing. Let's make LangChain even better together!

Thank you for your suggestion, ohbeep. Adding support for the Dynamic Schema feature of the Milvus vector database to the LangChain framework could indeed provide more flexibility and efficiency in handling different types of data.

Your proposed changes to the __init__ and from_texts methods of the Milvus class seem to be a good starting point. However, to fully implement this feature, we would also need to modify the way the schema is determined and the collection is created in these methods.

Currently, the schema for the Milvus collection is determined based on the first data being inserted into the collection. If the dynamic_schema flag is set to True, we would need to determine the schema based on all the data being inserted, not just the first data. This could be done in the from_texts method, where the data is being inserted into the collection.

Here is a rough example of how you could modify the from_texts method to support dynamic schemas:

@classmethod
def from_texts(
    cls,
    texts: List[str],
    embedding: Embeddings,
    metadatas: Optional[List[dict]] = None,
    collection_name: str = "LangChainCollection",
    connection_args: dict[str, Any] = DEFAULT_MILVUS_CONNECTION,
    consistency_level: str = "Session",
    index_params: Optional[dict] = None,
    search_params: Optional[dict] = None,
    drop_old: bool = False,
    dynamic_schema: bool = False,  # new flag for dynamic schema
    **kwargs: Any,
) -> Milvus:
    """Create a Milvus collection, indexes it with HNSW, and insert data.

    If dynamic_schema is True, the schema of the collection will be determined
    based on the data being inserted.
    """
    # existing code...

    if dynamic_schema:
        # determine schema based on data and create collection
        schema = determine_schema(texts, metadatas)
        collection = pymilvus.Collection(name=collection_name, schema=schema)
    else:
        # existing code to create collection...

In this example, determine_schema would be a function that determines the schema based on the data. You would need to implement this function based on the specific requirements of your application and the Dynamic Schema feature of Milvus.

Please note that this is a rough example and may need to be adjusted based on the specific requirements of your application and the details of the Dynamic Schema feature of Milvus.

I look forward to your PR. If you have any questions or need further clarification, feel free to ask.

Best regards, Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

slafeir commented 3 months ago

@ohbeep What is the status on this? I need this for my Milvus collections and would rather not fork.

lauradang commented 2 months ago

I have opened a PR here: https://github.com/langchain-ai/langchain/pull/21172. Hopefully, it gets merged soon.