run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
33.41k stars 4.68k forks source link

[Question]: What is the difference between vector store and storage context in terms of storing embeddings that are generated while creating a vector index? #12571

Open GildeshAbhay opened 3 months ago

GildeshAbhay commented 3 months ago

Question Validation

Question

So I have this code here

    node_parser = HierarchicalNodeParser.from_defaults(chunk_sizes=chunk_size)
    nodes = node_parser.get_nodes_from_documents([doc])
    service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model,node_parser=node_parser)
    storage_context = StorageContext.from_defaults()
    storage_context.docstore.add_documents(nodes)
    index = VectorStoreIndex(nodes, storage_context=storage_context, service_context=service_context)
    index.storage_context.persist(persist_dir=pathname)
    reranker = SentenceTransformerRerank(top_n = rr_metadata)
    retriever = index.as_retriever(similarity_top_k=retrieval_metadata_similarity)
    retriever = AutoMergingRetriever(retriever,index.storage_context,verbose=True)
    response_mode = 'tree_summarize'
    response_synthesizer = get_response_synthesizer(response_mode=response_mode)
    node_postprocessors = [postproc, reranker]
    node_postprocessors = [processor for processor in node_postprocessors if processor is not None]
    query_engine = RetrieverQueryEngine(retriever, node_postprocessors=node_postprocessors)
    summary_whole = query_engine.query(rag_prompt_original)

As can be seen, I am using the storage context to store the embeddings in pathname (which is in my local). Now, what to do if I want to replace this local path with mongodb? I found the below code via a tutorial, can you confirm if this is correct?

vector_store = MongoDBAtlasVectorSearch(client, db_name=db, collection_name=collection, index_name="vector_index")
vector_store.add(nodes)
index = VectorStoreIndex.from_vector_store(vector_store)

Also, I can't find any replacement of the "persist" in the topmost code in the above code. Somehow, I feel the mongodb code is just reading the indexes that are already saved in the mongodb.

Can someone tell me how to save my embeddings to mongodb while creating them for the first time, and then reading it from mongodb the next time?

dosubot[bot] commented 3 months ago

Hey @GildeshAbhay! Great to see you diving into another interesting challenge. How's everything going on your end?

Based on your current setup and the need to replace local storage with MongoDB, you can follow the steps below to implement MongoDB storage for your embeddings:

  1. Install Required Packages: Make sure you have installed all necessary packages, including pymongo for MongoDB interaction, and llama-index-vector-stores-mongodb for integrating MongoDB with Llama Index.

    !pip install -q pymongo llama-index-vector-stores-mongodb
  2. Set Up MongoDB Atlas Cluster: Set up a MongoDB cluster and get your connection string (URI). Create a database and collection within your MongoDB Atlas cluster.

  3. Create a Vector Search Index: Within your MongoDB Atlas collection, create a vector search index to enable efficient retrieval of documents based on vector similarity.

  4. Connect to MongoDB: Use the MongoDB URI to establish a connection to your MongoDB cluster and specify the database and collection you'll be working with.

    import pymongo
    
    mongo_uri = "<Your_MongoDB_URI>"
    client = pymongo.MongoClient(mongo_uri)
    db = client["your_database"]
    collection = db["your_collection"]
  5. Store Embeddings in MongoDB: Use the MongoDBAtlasVectorSearch class from llama-index-vector-stores-mongodb to interact with your MongoDB collection. This class allows you to add documents (with embeddings) to your MongoDB collection.

    from llama_index.vector_stores.mongodb import MongoDBAtlasVectorSearch
    
    vector_store = MongoDBAtlasVectorSearch(
        client,
        db_name="your_database",
        collection_name="your_collection",
        index_name="vector_index",  # Ensure this matches the name of your vector search index
    )
    vector_store.add(nodes)  # Assuming `nodes` contains your documents with embeddings
  6. Retrieve Embeddings: To retrieve embeddings or perform vector similarity searches, you can use the query_engine from the VectorStoreIndex class. This allows you to perform queries against your stored embeddings.

    from llama_index.core import VectorStoreIndex
    
    index = VectorStoreIndex.from_vector_store(vector_store)
    query_engine = index.as_query_engine()
    
    query = "Your search query"
    response = query_engine.query(query)

This approach replaces local storage with MongoDB for storing and retrieving embeddings, leveraging MongoDB's capabilities for handling large-scale vector data efficiently. You can refer to this example for more details.

Regarding the "persist" functionality, the StorageContext class in LlamaIndex handles the persisting of data through its persist method. This method is designed to save the state of the document store, index store, graph store, and all vector stores to a specified directory. However, in the context of MongoDB, the data is persisted in the database itself, and you don't need to explicitly call a "persist" method. When you add documents to the MongoDB collection using the add method of MongoDBAtlasVectorSearch, the data is immediately saved in the MongoDB database.

As for your question about a built-in method or class for integrating MongoDB as a storage context similar to the StorageContext.from_defaults() method used for local storage, LlamaIndex does not provide such a method or class specifically for MongoDB, based on the provided context. You would need to manually set up the MongoDB connection and use the MongoDBAtlasVectorSearch class to interact with your MongoDB collection.

I hope this helps! If you have any further questions, feel free to ask.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
GildeshAbhay commented 3 months ago

thanks

also, in my previous code

storage_context.docstore.add_documents(nodes)
index = VectorStoreIndex(nodes, storage_context=storage_context, service_context=service_context)

why exactly is the storage context used here if nodes are already used?

is this argument, storage context in the 2nd line above, basically saving the nodes to storage? Which is again confusing as we have the below line for it index.storage_context.persist(persist_dir=pathname)

my point for asking this followup question is this, that in the mongodb code you just gave, the index is created first and the mongodb vector_store.add_nodes occurs second. while in my code, the storage context is defined first and used while creating the indexes.

Lastly, If i keep the storage context code as well as the mongodb code both, would it be a problem ? I would not keep the storage.persist line so I don't think i would be using any unnncessary storage. Isn't it?

GildeshAbhay commented 3 months ago

quick follow up question,

index_name="vector_index", # Ensure this matches the name of your vector search index

So, my index code is this index = VectorStoreIndex(nodes, storage_context=storage_context, service_context=service_context) that's why the index name here would be "index"?

Thus the mongo code would be this

vector_store = MongoDBAtlasVectorSearch(
    client,
    db_name="your_database",
    collection_name="your_collection",
    index_name="index",  # Ensure this matches the name of your vector search index
)

Right?

logan-markewich commented 3 months ago

@GildeshAbhay Just to clarify, if you are using mongdb atlas (or any vectored integration), all the data is stored in the DB, you don't need a docstore or anything else unless you want key-val access to the nodes.

This also means there's no persist or loading, this all happens automatically.

For example, here's a full walkthrough

# Creating the index
vector_store = MongoDBAtlasVectorSearch(
    client,
    db_name="your_database",
    collection_name="your_collection",
    index_name="index",  # Ensure this matches the name of your vector search index
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# this will chunk and embed your nodes, and put them in MongoDB
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context, ...)

# you can also pre-chunk the nodes, and pass those in. This will not apply chunking, it will just embed
index = VectorStoreIndex(nodes=nodes, storage_context=storage_context, ...)

# then to "load" and existing index, just use from_vector_store
index = VectorStoreIndex.from_vector_store(vector_store)
GildeshAbhay commented 3 months ago

which field exactly are the embeddings stored in the collection ?

GildeshAbhay commented 3 months ago

If I want to "see" the embeddings in the mongodb collection, would I have to perform the below step?

for node in nodes:
    node_embedding = embed_model.get_text_embedding(
        node.get_content(metadata_mode="all")
    )
    node.embedding = node_embedding

and then

vector_store.add(nodes)