Open GildeshAbhay opened 3 months ago
Hey @GildeshAbhay! Great to see you diving into another interesting challenge. How's everything going on your end?
Based on your current setup and the need to replace local storage with MongoDB, you can follow the steps below to implement MongoDB storage for your embeddings:
Install Required Packages: Make sure you have installed all necessary packages, including pymongo
for MongoDB interaction, and llama-index-vector-stores-mongodb
for integrating MongoDB with Llama Index.
!pip install -q pymongo llama-index-vector-stores-mongodb
Set Up MongoDB Atlas Cluster: Set up a MongoDB cluster and get your connection string (URI). Create a database and collection within your MongoDB Atlas cluster.
Create a Vector Search Index: Within your MongoDB Atlas collection, create a vector search index to enable efficient retrieval of documents based on vector similarity.
Connect to MongoDB: Use the MongoDB URI to establish a connection to your MongoDB cluster and specify the database and collection you'll be working with.
import pymongo
mongo_uri = "<Your_MongoDB_URI>"
client = pymongo.MongoClient(mongo_uri)
db = client["your_database"]
collection = db["your_collection"]
Store Embeddings in MongoDB: Use the MongoDBAtlasVectorSearch
class from llama-index-vector-stores-mongodb
to interact with your MongoDB collection. This class allows you to add documents (with embeddings) to your MongoDB collection.
from llama_index.vector_stores.mongodb import MongoDBAtlasVectorSearch
vector_store = MongoDBAtlasVectorSearch(
client,
db_name="your_database",
collection_name="your_collection",
index_name="vector_index", # Ensure this matches the name of your vector search index
)
vector_store.add(nodes) # Assuming `nodes` contains your documents with embeddings
Retrieve Embeddings: To retrieve embeddings or perform vector similarity searches, you can use the query_engine
from the VectorStoreIndex
class. This allows you to perform queries against your stored embeddings.
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_vector_store(vector_store)
query_engine = index.as_query_engine()
query = "Your search query"
response = query_engine.query(query)
This approach replaces local storage with MongoDB for storing and retrieving embeddings, leveraging MongoDB's capabilities for handling large-scale vector data efficiently. You can refer to this example for more details.
Regarding the "persist" functionality, the StorageContext
class in LlamaIndex handles the persisting of data through its persist
method. This method is designed to save the state of the document store, index store, graph store, and all vector stores to a specified directory. However, in the context of MongoDB, the data is persisted in the database itself, and you don't need to explicitly call a "persist" method. When you add documents to the MongoDB collection using the add
method of MongoDBAtlasVectorSearch
, the data is immediately saved in the MongoDB database.
As for your question about a built-in method or class for integrating MongoDB as a storage context similar to the StorageContext.from_defaults()
method used for local storage, LlamaIndex does not provide such a method or class specifically for MongoDB, based on the provided context. You would need to manually set up the MongoDB connection and use the MongoDBAtlasVectorSearch
class to interact with your MongoDB collection.
I hope this helps! If you have any further questions, feel free to ask.
thanks
also, in my previous code
storage_context.docstore.add_documents(nodes)
index = VectorStoreIndex(nodes, storage_context=storage_context, service_context=service_context)
why exactly is the storage context used here if nodes are already used?
is this argument, storage context in the 2nd line above, basically saving the nodes to storage? Which is again confusing as we have the below line for it
index.storage_context.persist(persist_dir=pathname)
my point for asking this followup question is this, that in the mongodb code you just gave, the index is created first and the mongodb vector_store.add_nodes
occurs second. while in my code, the storage context is defined first and used while creating the indexes.
Lastly, If i keep the storage context code as well as the mongodb code both, would it be a problem ? I would not keep the storage.persist line so I don't think i would be using any unnncessary storage. Isn't it?
quick follow up question,
index_name="vector_index", # Ensure this matches the name of your vector search index
So, my index code is this
index = VectorStoreIndex(nodes, storage_context=storage_context, service_context=service_context)
that's why the index name here would be "index"?
Thus the mongo code would be this
vector_store = MongoDBAtlasVectorSearch(
client,
db_name="your_database",
collection_name="your_collection",
index_name="index", # Ensure this matches the name of your vector search index
)
Right?
@GildeshAbhay Just to clarify, if you are using mongdb atlas (or any vectored integration), all the data is stored in the DB, you don't need a docstore or anything else unless you want key-val access to the nodes.
This also means there's no persist or loading, this all happens automatically.
For example, here's a full walkthrough
# Creating the index
vector_store = MongoDBAtlasVectorSearch(
client,
db_name="your_database",
collection_name="your_collection",
index_name="index", # Ensure this matches the name of your vector search index
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# this will chunk and embed your nodes, and put them in MongoDB
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context, ...)
# you can also pre-chunk the nodes, and pass those in. This will not apply chunking, it will just embed
index = VectorStoreIndex(nodes=nodes, storage_context=storage_context, ...)
# then to "load" and existing index, just use from_vector_store
index = VectorStoreIndex.from_vector_store(vector_store)
which field exactly are the embeddings stored in the collection ?
If I want to "see" the embeddings in the mongodb collection, would I have to perform the below step?
for node in nodes:
node_embedding = embed_model.get_text_embedding(
node.get_content(metadata_mode="all")
)
node.embedding = node_embedding
and then
vector_store.add(nodes)
Question Validation
Question
So I have this code here
As can be seen, I am using the storage context to store the embeddings in pathname (which is in my local). Now, what to do if I want to replace this local path with mongodb? I found the below code via a tutorial, can you confirm if this is correct?
Also, I can't find any replacement of the "persist" in the topmost code in the above code. Somehow, I feel the mongodb code is just reading the indexes that are already saved in the mongodb.
Can someone tell me how to save my embeddings to mongodb while creating them for the first time, and then reading it from mongodb the next time?