run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
34.68k stars 4.9k forks source link

Indexes cannot be created correctly using the MilvusVectorStore. #14779

Closed hingkan closed 1 month ago

hingkan commented 1 month ago

Question Validation

Question

When I used MilvusVectorStore to create collections and indexes in Milvus, only the collections were successfully created and the index creation failed.

CODE: ` vector_store = MilvusVectorStore( uri=self.uri, collection_name=self._collection, similarity_metric=self.metric_type, dim=self.dim, overwrite=_overwrite, enable_sparse=True,

sparse_embedding_function=MyEmbeddingFunction(),

        hybrid_ranker=self.hybrid_ranker,
        hybrid_ranker_params=self.hybrid_ranker_params,
    )

`

ERROR: pymilvus.exceptions.MilvusException: <MilvusException: (code=1100, message=create index on 104 field is not supported: invalid parameter[expected=supported field][actual=create index on 104 field])>

dosubot[bot] commented 1 month ago

Hey @hingkan! I'm here to help you with any issues you're facing. I'm a bot that can assist with bug fixes, questions, and becoming a contributor. Let's tackle this together!

To resolve the error "pymilvus.exceptions.MilvusException: <MilvusException: (code=1100, message=create index on 104 field is not supported: invalid parameter[expected=supported field][actual=create index on 104 field])>" and successfully create indexes using MilvusVectorStore, ensure that the field you are trying to create an index on is supported by Milvus.

Here's an example of how to correctly set up and create an index using MilvusVectorStore:

from llama_index.core import StorageContext
from llama_index.vector_stores.milvus import MilvusVectorStore
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

# Load documents
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

# Create an index over the documents
vector_store = MilvusVectorStore(
    dim=1536,
    overwrite=True,
    enable_sparse=True,
    hybrid_ranker="RRFRanker",
    hybrid_ranker_params={"k": 60},
    index_config={
        "index_type": "IVF_FLAT",
        "metric_type": "L2",
        "params": {"nlist": 128}
    }
)

storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

Ensure that the dim parameter matches the dimension of your embeddings and that the embedding_field is correctly specified if you are using a custom field. The overwrite=True parameter ensures that any existing collection with the same name is overwritten, which can help avoid conflicts.

If you are using a custom sparse embedding function, make sure it is correctly implemented and passed to the MilvusVectorStore:

from FlagEmbedding import BGEM3FlagModel
from typing import List
from llama_index.vector_stores.milvus.utils import BaseSparseEmbeddingFunction

class ExampleEmbeddingFunction(BaseSparseEmbeddingFunction):
    def __init__(self):
        self.model = BGEM3FlagModel("BAAI/bge-m3", use_fp16=False)

    def encode_queries(self, queries: List[str]):
        outputs = self.model.encode(
            queries,
            return_dense=False,
            return_sparse=True,
            return_colbert_vecs=False,
        )["lexical_weights"]
        return [self._to_standard_dict(output) for output in outputs]

    def encode_documents(self, documents: List[str]):
        outputs = self.model.encode(
            documents,
            return_dense=False,
            return_sparse=True,
            return_colbert_vecs=False,
        )["lexical_weights"]
        return [self._to_standard_dict(output) for output in outputs]

    def _to_standard_dict(self, raw_output):
        result = {}
        for k in raw_output:
            result[int(k)] = raw_output[k]
        return result

# Use the custom sparse embedding function
vector_store = MilvusVectorStore(
    dim=1536,
    overwrite=True,
    enable_sparse=True,
    sparse_embedding_function=ExampleEmbeddingFunction(),
    hybrid_ranker="RRFRanker",
    hybrid_ranker_params={"k": 60},
    index_config={
        "index_type": "IVF_FLAT",
        "metric_type": "L2",
        "params": {"nlist": 128}
    }
)

Ensure that the index_config parameter is correctly set when creating the MilvusVectorStore. This parameter is crucial for specifying the configuration used for building the Milvus index [1][2][3].

To continue talking to Dosu, mention @dosu.

hingkan commented 1 month ago

Hey @hingkan! I'm here to help you with any issues you're facing. I'm a bot that can assist with bug fixes, questions, and becoming a contributor. Let's tackle this together!

To resolve the error "pymilvus.exceptions.MilvusException: <MilvusException: (code=1100, message=create index on 104 field is not supported: invalid parameter[expected=supported field][actual=create index on 104 field])>" and successfully create indexes using MilvusVectorStore, ensure that the field you are trying to create an index on is supported by Milvus.

Here's an example of how to correctly set up and create an index using MilvusVectorStore:

from llama_index.core import StorageContext
from llama_index.vector_stores.milvus import MilvusVectorStore
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

# Load documents
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

# Create an index over the documents
vector_store = MilvusVectorStore(
    dim=1536,
    overwrite=True,
    enable_sparse=True,
    hybrid_ranker="RRFRanker",
    hybrid_ranker_params={"k": 60},
    index_config={
        "index_type": "IVF_FLAT",
        "metric_type": "L2",
        "params": {"nlist": 128}
    }
)

storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

Ensure that the dim parameter matches the dimension of your embeddings and that the embedding_field is correctly specified if you are using a custom field. The overwrite=True parameter ensures that any existing collection with the same name is overwritten, which can help avoid conflicts.

If you are using a custom sparse embedding function, make sure it is correctly implemented and passed to the MilvusVectorStore:

from FlagEmbedding import BGEM3FlagModel
from typing import List
from llama_index.vector_stores.milvus.utils import BaseSparseEmbeddingFunction

class ExampleEmbeddingFunction(BaseSparseEmbeddingFunction):
    def __init__(self):
        self.model = BGEM3FlagModel("BAAI/bge-m3", use_fp16=False)

    def encode_queries(self, queries: List[str]):
        outputs = self.model.encode(
            queries,
            return_dense=False,
            return_sparse=True,
            return_colbert_vecs=False,
        )["lexical_weights"]
        return [self._to_standard_dict(output) for output in outputs]

    def encode_documents(self, documents: List[str]):
        outputs = self.model.encode(
            documents,
            return_dense=False,
            return_sparse=True,
            return_colbert_vecs=False,
        )["lexical_weights"]
        return [self._to_standard_dict(output) for output in outputs]

    def _to_standard_dict(self, raw_output):
        result = {}
        for k in raw_output:
            result[int(k)] = raw_output[k]
        return result

# Use the custom sparse embedding function
vector_store = MilvusVectorStore(
    dim=1536,
    overwrite=True,
    enable_sparse=True,
    sparse_embedding_function=ExampleEmbeddingFunction(),
    hybrid_ranker="RRFRanker",
    hybrid_ranker_params={"k": 60},
    index_config={
        "index_type": "IVF_FLAT",
        "metric_type": "L2",
        "params": {"nlist": 128}
    }
)

Ensure that the index_config parameter is correctly set when creating the MilvusVectorStore. This parameter is crucial for specifying the configuration used for building the Milvus index [1][2][3].

To continue talking to Dosu, mention @dosu.

After adding index_config, the error is still the same. Only the collection is created, no index is created.

hingkan commented 1 month ago

Is only Milvus Lite available on the MilvusVectorStore ?

dosubot[bot] commented 1 month ago

The error message you are encountering suggests that the field you are trying to create an index on is not supported. This could be due to an incorrect field type or an unsupported operation for that field.

To resolve the index creation issue, ensure that the field you are indexing is of a supported type for the specified index. Here is a revised version of your code snippet with a focus on ensuring the correct field type:

vector_store = MilvusVectorStore(
    uri=self.uri,
    collection_name=self._collection,
    similarity_metric=self.metric_type,
    dim=self.dim,
    overwrite=_overwrite,
    enable_sparse=True,
    hybrid_ranker=self.hybrid_ranker,
    hybrid_ranker_params=self.hybrid_ranker_params,
    index_config={
        "index_type": "IVF_FLAT",
        "metric_type": "L2",
        "params": {"nlist": 128}
    }
)

Ensure that the dim parameter matches the dimensionality of the vectors you are storing and that the field you are indexing is a vector field.

Additionally, Milvus Lite is available in the MilvusVectorStore. You can use it by setting the uri to a local file path, which automatically utilizes Milvus Lite:

vector_store = MilvusVectorStore(
    uri="./milvus_demo.db", dim=1536, overwrite=True
)

If the issue persists, double-check the field types in your collection schema and ensure they are compatible with the index type you are trying to create. The _create_index_if_required method in the MilvusVectorStore class ensures that the index is created if required. Make sure that the embedding_field is correctly specified and that the index_config parameters are correctly set [1][2][3].

To continue talking to Dosu, mention @dosu.

hingkan commented 1 month ago

The error message you are encountering suggests that the field you are trying to create an index on is not supported. This could be due to an incorrect field type or an unsupported operation for that field.

To resolve the index creation issue, ensure that the field you are indexing is of a supported type for the specified index. Here is a revised version of your code snippet with a focus on ensuring the correct field type:

vector_store = MilvusVectorStore(
    uri=self.uri,
    collection_name=self._collection,
    similarity_metric=self.metric_type,
    dim=self.dim,
    overwrite=_overwrite,
    enable_sparse=True,
    hybrid_ranker=self.hybrid_ranker,
    hybrid_ranker_params=self.hybrid_ranker_params,
    index_config={
        "index_type": "IVF_FLAT",
        "metric_type": "L2",
        "params": {"nlist": 128}
    }
)

Ensure that the dim parameter matches the dimensionality of the vectors you are storing and that the field you are indexing is a vector field.

Additionally, Milvus Lite is available in the MilvusVectorStore. You can use it by setting the uri to a local file path, which automatically utilizes Milvus Lite:

vector_store = MilvusVectorStore(
    uri="./milvus_demo.db", dim=1536, overwrite=True
)

If the issue persists, double-check the field types in your collection schema and ensure they are compatible with the index type you are trying to create. The _create_index_if_required method in the MilvusVectorStore class ensures that the index is created if required. Make sure that the embedding_field is correctly specified and that the index_config parameters are correctly set [1][2][3].

To continue talking to Dosu, mention @dosu.

Still there is the same error, how to correctly create collections and indexes on Milvus

dosubot[bot] commented 1 month ago

To correctly create collections and indexes on Milvus using MilvusVectorStore without encountering the error related to unsupported field indexing, ensure that the index_config and search_config are properly set and compatible with the index type specified. Here is an example setup:

from llama_index.vector_stores.milvus import MilvusVectorStore

# Setup MilvusVectorStore
vector_store = MilvusVectorStore(
    dim=1536,
    collection_name="your_collection_name",
    uri="http://milvus_address:port",
    token="your_milvus_token_here",
    overwrite=True,
    index_config={
        "index_type": "IVF_FLAT",  # Example index type
        "nlist": 128  # Example parameter for the index
    },
    search_config={
        "metric_type": "L2",  # Ensure this matches the similarity metric
        "params": {"nprobe": 10}  # Example search parameter
    }
)

In this setup:

Make sure to replace "IVF_FLAT" and other parameters with the appropriate values for your use case. Additionally, ensure that the similarity_metric parameter in the MilvusVectorStore initialization matches the metric_type in the search_config.

For a more detailed example, you can refer to the Milvus Hybrid Index Demo which demonstrates how to build a Retrieval-Augmented Generation (RAG) system using LlamaIndex and Milvus [1][2][3][4].

To continue talking to Dosu, mention @dosu.

hingkan commented 1 month ago

I simplified some parameters, and then a new error appeared.

CODE: vector_store = MilvusVectorStore(uri="http://localhost:19530/", dim=1024, overwrite=True, index_config={ 'metric_type': 'IP', 'index_type': "IVF_FLAT", 'params': {'nlist': 1024} })

ERROR:pymilvus.exceptions.MilvusException: <MilvusException: (code=65535, message=nlist out of range: [1, 65536])>

dosubot[bot] commented 1 month ago

The "nlist out of range" error indicates that the nlist parameter value is outside the acceptable range for the specified index type. For IVF_FLAT, the nlist parameter must be within the range [1, 65536].

To resolve this, ensure that the nlist parameter is set within this range. Here is the corrected code snippet:

vector_store = MilvusVectorStore(
    uri="http://localhost:19530/",
    dim=1024,
    overwrite=True,
    index_config={
        'metric_type': 'IP',
        'index_type': "IVF_FLAT",
        'params': {'nlist': 128}  # Set nlist within the valid range
    }
)

In this example, nlist is set to 128, which is within the valid range for IVF_FLAT [1].

To continue talking to Dosu, mention @dosu.

hingkan commented 1 month ago

The "nlist out of range" error indicates that the nlist parameter value is outside the acceptable range for the specified index type. For IVF_FLAT, the nlist parameter must be within the range [1, 65536].

To resolve this, ensure that the nlist parameter is set within this range. Here is the corrected code snippet:

vector_store = MilvusVectorStore(
    uri="http://localhost:19530/",
    dim=1024,
    overwrite=True,
    index_config={
        'metric_type': 'IP',
        'index_type': "IVF_FLAT",
        'params': {'nlist': 128}  # Set nlist within the valid range
    }
)

In this example, nlist is set to 128, which is within the valid range for IVF_FLAT [1].

To continue talking to Dosu, mention @dosu.

nlist=1024 will still appear 'nlist out of range: [1, 65536]'

dosubot[bot] commented 1 month ago

To avoid the "nlist out of range" error and successfully create the index on Milvus using MilvusVectorStore, ensure that the nlist parameter is set correctly within the index_config. The nlist parameter should be within the range [1, 65536] for the IVF_FLAT index type.

Here is an example of how to set it correctly:

vector_store = MilvusVectorStore(
    uri="http://localhost:19530/",
    dim=1024,
    overwrite=True,
    index_config={
        'metric_type': 'IP',
        'index_type': "IVF_FLAT",
        'nlist': 1024  # Ensure nlist is within the valid range
    }
)

Ensure that the value of nlist is within the acceptable range for the index type you are using. If you encounter the "nlist out of range" error, you may need to adjust the value of nlist based on the size of your dataset and the specific requirements of the index type [1].

To continue talking to Dosu, mention @dosu.

hingkan commented 1 month ago

To avoid the "nlist out of range" error and successfully create the index on Milvus using MilvusVectorStore, ensure that the nlist parameter is set correctly within the index_config. The nlist parameter should be within the range [1, 65536] for the IVF_FLAT index type.

Here is an example of how to set it correctly:

vector_store = MilvusVectorStore(
    uri="http://localhost:19530/",
    dim=1024,
    overwrite=True,
    index_config={
        'metric_type': 'IP',
        'index_type': "IVF_FLAT",
        'nlist': 1024  # Ensure nlist is within the valid range
    }
)

Ensure that the value of nlist is within the acceptable range for the index type you are using. If you encounter the "nlist out of range" error, you may need to adjust the value of nlist based on the size of your dataset and the specific requirements of the index type [1].

To continue talking to Dosu, mention @dosu.

The problem with 'nlist' is resolved, but a new error appears. error: pymilvus.exceptions.MilvusException: <MilvusException: (code=1100, message=the num_rows (1) of field (embedding) is not equal to passed num_rows (3): invalid parameter[expected=3][actual=3])>

dosubot[bot] commented 1 month ago

The error pymilvus.exceptions.MilvusException: <MilvusException: (code=1100, message=the num_rows (1) of field (embedding) is not equal to passed num_rows (3): invalid parameter[expected=3][actual=3])> indicates a mismatch between the number of rows in the embedding field and the number of rows expected by Milvus.

To resolve this error, ensure that the number of embeddings you are inserting matches the number of rows expected by Milvus. Specifically, when adding nodes to the Milvus collection, each node should have an embedding that matches the expected dimension and count.

Here is the relevant part of the MilvusVectorStore class that handles the insertion of nodes:

def add(self, nodes: List[BaseNode], **add_kwargs: Any) -> List[str]:
    """Add the embeddings and their nodes into Milvus.

    Args:
        nodes (List[BaseNode]): List of nodes with embeddings
            to insert.

    Raises:
        MilvusException: Failed to insert data.

    Returns:
        List[str]: List of ids inserted.
    """
    insert_list = []
    insert_ids = []

    # Process that data we are going to insert
    for node in nodes:
        entry = node_to_metadata_dict(node)
        entry[MILVUS_ID_FIELD] = node.node_id
        entry[self.embedding_field] = node.embedding

        insert_ids.append(node.node_id)
        insert_list.append(entry)

    # Insert the data into milvus
    self.collection.insert(insert_list)
    self.collection.flush()
    self._create_index_if_required()
    logger.debug(
        f"Successfully inserted embeddings into: {self.collection_name} "
        f"Num Inserted: {len(insert_list)}"
    )
    return insert_ids

Ensure that each node in the nodes list has a properly formatted embedding. The embedding should be a list or array with the correct number of dimensions as specified when the collection was created.

For example, if your collection expects embeddings of dimension 128, each node's embedding should be a list or array of length 128.

If you are still encountering issues, double-check the following:

  1. The dimension of the embeddings matches the dimension specified during collection creation.
  2. The number of embeddings being inserted matches the number of rows expected by Milvus.

Additionally, ensure that the collection is properly configured and created. Here is an example of setting up the MilvusVectorStore:

from llama_index.vector_stores.milvus import MilvusVectorStore

# Setup MilvusVectorStore
vector_store = MilvusVectorStore(
    dim=1536,
    collection_name="your_collection_name",
    uri="http://milvus_address:port",
    token="your_milvus_token_here",
    overwrite=True
)

If the issue persists, you may need to review the data being passed to ensure it is correctly formatted and consistent with the collection's schema [1][2].

To continue talking to Dosu, mention @dosu.

hingkan commented 1 month ago

enable_sparse=False, it can use the MilvusVectorStore to create collections and indexes for Milvus.

hingkan commented 1 month ago

When creating a Milvus index using MilvusVectorStore, setting enable_sparse to True would not create an index, but I wanted to use Milvus's hybrid search