run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.59k stars 5.23k forks source link

[Question]: #14971

Closed oomarish closed 1 week ago

oomarish commented 3 months ago

Question Validation

Question

Hi! I am a new user of llama-index, I appreciate the work done here, but I have a question about VectorStoreIndex object coming from llama_index.core.indices.vector_store.base.py, the functioning of from_documents function is very abstract that I can not see the content of this object, I mean the vectors created after the tokenization and the embedding. And how to make it persist in the vector database? I tried to check for newly inserted data but there is nothing. If any one can also explain me how does the query engine work and what relation does it have with the index because i found it a bit complicated. Thanks in advance.

dosubot[bot] commented 3 months ago

Hey @OmarD-ctrl! I'm here to assist you with llama-index. I can help you with bug fixes, answer questions, and even guide you on how to become a contributor. Let's tackle this together!

To make the vectors created by the from_documents function in the VectorStoreIndex object persist in the vector database, you need to specify a StorageContext with a persistent vector store. Here is an example using Epsilla:

import logging
import sys

# Uncomment to see debug logs
# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

from llama_index.core import SimpleDirectoryReader, Document, StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.epsilla import EpsillaVectorStore
from pyepsilla import vectordb

# Initialize Epsilla client
client = vectordb.Client()

# Create vector store and customize storage context
vector_store = EpsillaVectorStore(client=client, db_path="/tmp/llamastore")
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Load documents and build index
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)

# Persist the storage context to disk
index.storage_context.persist(persist_dir="<persist_dir>")

You can then avoid re-loading and re-indexing your data by loading the persisted index like this:

from llama_index.core import StorageContext, load_index_from_storage

# Rebuild storage context
storage_context = StorageContext.from_defaults(persist_dir="<persist_dir>")

# Load index
index = load_index_from_storage(storage_context)

Alternatively, you can use Chroma as a persistent vector store:

from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb

def get_vector_store(docset_id) -> ChromaVectorStore:
    db = chromadb.PersistentClient(path="/path/to/chroma_directory")
    chroma_collection = db.get_or_create_collection(docset_id)
    return ChromaVectorStore(
        chroma_collection=chroma_collection, embed_model="your_embedding_model"
    )

def get_vector_store_index(docset_id, embedding) -> VectorStoreIndex:
    vector_store = get_vector_store(docset_id)
    return VectorStoreIndex.from_vector_store(
        vector_store,
        embed_model=embedding,
    )

Regarding the query engine and its relation to the index, the VectorStoreIndex can retrieve various objects, including query engines. When these objects are retrieved, they are automatically executed using the provided query. Here is an example:

from llama_index.core.schema import IndexNode

query_engine = other_index.as_query_engine
obj = IndexNode(
    text="A query engine describing X, Y, and Z.",
    obj=query_engine,
    index_id="my_query_engine",
)

index = VectorStoreIndex(nodes=nodes, objects=[obj])
retriever = index.as_retreiver(verbose=True)

In this example, if the index node containing the query engine is retrieved, the query engine will be executed, and the resulting response will be returned as a node [1][2][3][4].

To continue talking to Dosu, mention @dosu.

oomarish commented 3 months ago

The persist function actually saves the data to the local pc, although I've already made a pgvector connected to a postgresql database in a docker container.