run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
34.74k stars 4.9k forks source link

[Bug]: Persisting vectors in MongoDB (Cosmos DB) #13889

Closed aganiezgoda closed 2 months ago

aganiezgoda commented 2 months ago

Bug Description

I'm trying to do RAG based on Property Graphs using Mongo DB (Cosmos DB version) for vector persistance:

llm = AzureOpenAI(
    engine="xxx",
    model="gpt-35-turbo-16k",
    temperature=0.0,
    azure_endpoint="https://xxx.openai.azure.com/",
    api_key="xxxx",
    api_version="2023-07-01-preview",
)

embeddings = AzureOpenAIEmbedding(
    engine = "xxx",
    model = "text-embedding-ada-002",
    azure_endpoint="https://xxx.openai.azure.com/",
    api_key="xxx", 
    api_version="2023-12-01-preview",
    )

########### MONGO DB

connection_string = os.environ.get("AZURE_COSMOSDB_MONGODB_URI")
mongodb_client = pymongo.MongoClient(connection_string)

vector_store = AzureCosmosDBMongoDBVectorSearch(
    mongodb_client=mongodb_client,
    db_name="demo_vectordb",
    collection_name="aaa"
)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

graph_store = SimplePropertyGraphStore.from_persist_dir("./storage")

index = PropertyGraphIndex.from_documents(
    documents, 
    llm = llm, 
    embed_model=embeddings,
    property_graph_store=graph_store,
    vector_store=vector_store,
    kg_extractors=[
        ImplicitPathExtractor(llm = llm),
        SimpleLLMPathExtractor(
            llm=llm,
            num_workers=4,
            max_paths_per_chunk=10,
        ),
        ],
            show_progress=True,
        )

index = PropertyGraphIndex.from_existing(
    property_graph_store=graph_store,
    # optional, neo4j also supports vectors directly
    vector_store=vector_store,
    #storage_context=storage_context,
    embed_kg_nodes=True,
    llm=llm,
    embed_model = embeddings,
)

The last part fails. Error message, shortened:

    index = PropertyGraphIndex.from_existing(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "C:\Users\xxx\.venv\Lib\site-packages\pymongo\_csot.py", line 108, in csot_wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xxx\.venv\Lib\site-packages\pymongo\collection.py", line 726, in insert_many       
    raise TypeError("documents must be a non-empty list")
TypeError: documents must be a non-empty list

Version

0.10.40

Steps to Reproduce

see above

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 2 months ago

To resolve the "TypeError: documents must be a non-empty list" error when using PropertyGraphIndex.from_existing, ensure that the nodes parameter is provided as a non-empty list. The from_existing method initializes the nodes parameter as an empty list, which might be causing the error.

Here's the relevant part of the from_existing method:

@classmethod
def from_existing(
    cls: "PropertyGraphIndex",
    property_graph_store: PropertyGraphStore,
    vector_store: Optional[BasePydanticVectorStore] = None,
    # general params
    llm: Optional[BaseLLM] = None,
    kg_extractors: Optional[List[TransformComponent]] = None,
    # vector related params
    use_async: bool = True,
    embed_model: Optional[EmbedType] = None,
    embed_kg_nodes: bool = True,
    # parent class params
    callback_manager: Optional[CallbackManager] = None,
    transformations: Optional[List[TransformComponent]] = None,
    storage_context: Optional[StorageContext] = None,
    show_progress: bool = False,
    **kwargs: Any,
) -> "PropertyGraphIndex":
    """Create an index from an existing property graph store (and optional vector store)."""
    return cls(
        nodes=[],  # no nodes to insert
        property_graph_store=property_graph_store,
        vector_store=vector_store,
        llm=llm,
        kg_extractors=kg_extractors,
        use_async=use_async,
        embed_model=embed_model,
        embed_kg_nodes=embed_kg_nodes,
        callback_manager=callback_manager,
        transformations=transformations,
        storage_context=storage_context,
        show_progress=show_progress,
        **kwargs,
    )

To fix this, you need to ensure that the nodes parameter is a non-empty list when calling the from_existing method. If you don't have any nodes to insert initially, you might need to handle this case separately in your implementation to avoid passing an empty list.

For example:

# Ensure nodes is a non-empty list
nodes = [your_nodes] if your_nodes else [default_node]

# Create the PropertyGraphIndex
index = PropertyGraphIndex.from_existing(
    property_graph_store=graph_store,
    vector_store=vector_store,
    nodes=nodes,
    llm=llm,
    kg_extractors=[
        ImplicitPathExtractor(llm=llm),
        SimpleLLMPathExtractor(
            llm=llm,
            num_workers=4,
            max_paths_per_chunk=10,
        ),
    ],
    embed_kg_nodes=True,
    embed_model=embeddings,
    show_progress=True,
)

Additionally, ensure that your MongoDB client and vector store are correctly set up. Here is an example of how to initialize the AzureCosmosDBMongoDBVectorSearch:

import pymongo
from llama_index.vector_stores.azurecosmosmongo import AzureCosmosDBMongoDBVectorSearch

# Set up the connection string with your Azure CosmosDB MongoDB URI
connection_string = "YOUR_AZURE_COSMOSDB_MONGODB_URI"
mongodb_client = pymongo.MongoClient(connection_string)

# Create an instance of AzureCosmosDBMongoDBVectorSearch
vector_store = AzureCosmosDBMongoDBVectorSearch(
    mongodb_client=mongodb_client,
    db_name="demo_vectordb",
    collection_name="aaa",
)

Ensure that the nodes you are passing to the from_existing method are correctly formatted and contain the necessary embeddings and metadata [1][2][3].

To continue talking to Dosu, mention @dosu.

aganiezgoda commented 2 months ago

@dosu AzureCosmosDBMongoDBVectorSearch seems to be initialized correctly - documents are stored there without issues.

When I add additional node definition (nodes=nodes, with reference to MongoDB's collection), so that it looks like that:

# Create the PropertyGraphIndex
index = PropertyGraphIndex.from_existing(
    property_graph_store=graph_store,
    vector_store=vector_store,
    nodes=nodes,
    llm=llm,
    kg_extractors=[
        ImplicitPathExtractor(llm=llm),
        SimpleLLMPathExtractor(
            llm=llm,
            num_workers=4,
            max_paths_per_chunk=10,
        ),
    ],
    embed_kg_nodes=True,
    embed_model=embeddings,
    show_progress=True,
)

...I get:

llama_index.core.indices.property_graph.base.PropertyGraphIndex() got multiple values for keyword argument 'nodes'

dosubot[bot] commented 2 months ago

The error "PropertyGraphIndex() got multiple values for keyword argument 'nodes'" occurs because the nodes argument is being passed both directly and through **kwargs in the from_existing method. The from_existing method does not accept a nodes parameter directly; it initializes nodes internally as an empty list.

Here's how you should call the from_existing method correctly:

# Create the PropertyGraphIndex
index = PropertyGraphIndex.from_existing(
    property_graph_store=graph_store,
    vector_store=vector_store,
    llm=llm,
    kg_extractors=[
        ImplicitPathExtractor(llm=llm),
        SimpleLLMPathExtractor(
            llm=llm,
            num_workers=4,
            max_paths_per_chunk=10,
        ),
    ],
    embed_kg_nodes=True,
    embed_model=embeddings,
    show_progress=True,
)

Ensure that you do not pass the nodes parameter to the from_existing method, as it is not expected and causes the conflict [1].

To continue talking to Dosu, mention @dosu.

logan-markewich commented 2 months ago

Hmm, I probably need to update the _insert_nodes function to skip if there's no new nodes/documents

logan-markewich commented 2 months ago

I'll make a release later today with this fix (quite a few other fixes in the pipeline as well). Sorry for the issues, appreciate the patience 🙏🏻