run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.29k stars 4.96k forks source link

[Question]: What information is stored in VectorStoreIndex, where can I view this information, and why does print(index.index_struct) only have index_id #15009

Open haozhuoyuan opened 1 month ago

haozhuoyuan commented 1 month ago

Question Validation

Question

from weaviate.classes.config import Configure, Property, DataType
import time
import weaviate
import chromadb
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.vector_stores.weaviate import WeaviateVectorStore
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.ingestion import IngestionPipeline
from llama_index.readers.file import PyMuPDFReader
from llama_index.core import Settings
from llama_index.core import StorageContext
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.postprocessor import SentenceTransformerRerank
from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.chat_engine import CondenseQuestionChatEngine
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from dotenv import load_dotenv, find_dotenv

# 加载环境变量
_ = load_dotenv(find_dotenv())
import nest_asyncio
nest_asyncio.apply() # 只在Jupyter笔记环境中需要此操作,否则会报错

# 连接到local,需要启动docker
weaviate_client = weaviate.connect_to_local(host="localhost", port=8080, grpc_port=50051, skip_init_checks=True)
# weaviate_client = weaviate.connect_to_local(host="localhost", port=8080, grpc_port=50051)
# 指定全局llm与embedding模型
Settings.llm = OpenAI(temperature=0, model="gpt-4o")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small", dimensions=512)
# Settings.embed_model = HuggingFaceEmbedding(model_name=r"C:\Users\79033\.cache\huggingface\hub\models--BAAI--bge-m3\snapshots\5617a9f61b028005a4858fdac845db406aefb181", )

splitter = SentenceSplitter(chunk_size=512, chunk_overlap=100)

# 加载 pdf 文档
documents = SimpleDirectoryReader("./data1", file_extractor={".pdf": PyMuPDFReader()}).load_data()

# 切分nodes
nodes = splitter.get_nodes_from_documents(documents)

# 创建 Vector Store
# vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
vector_store = WeaviateVectorStore(weaviate_client=weaviate_client, index_name="LlamaIndex1")
# 指定 Vector Store 用于 index
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex(
    nodes, storage_context=storage_context
)
print(index.index_struct)

query_engine = index.as_query_engine()
while True:
    question = input("User:")
    if question.strip() == "":
        break
    start_time = time.time()
    response = query_engine.query(question)
    for node in response.source_nodes:
        print(f"Text: {node.node.text}")
        print(f"Score: {node.score}")
        print(f"Metadata: {node.node.metadata}")
        print("---")
    end_time = time.time()
    print(end_time-start_time)
    print(f"AI: {response}")
dosubot[bot] commented 1 month ago

The VectorStoreIndex stores embeddings for the input text chunks, along with metadata such as document IDs and node IDs. This information can be viewed by querying the index or by inspecting the underlying vector store directly.

The command print(index.index_struct) only outputs index_id because index_struct is a minimal representation of the index structure, primarily used for identification purposes. The detailed embeddings and other stored information are managed by the vector store backend and are not directly exposed through index_struct.

To view the detailed information, you would typically interact with the vector store directly or use the querying capabilities provided by VectorStoreIndex.

Example of querying the index:

query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
print(response)

Additionally, the index_struct contains more than just the index_id. It includes a nodes_dict mapping node IDs to their respective embeddings, a doc_id_dict mapping document IDs to their associated node IDs, and an embeddings_dict which stores the actual embeddings. However, these details are not printed by default with print(index.index_struct).

Here is an example of the structure:

{
  "index_struct": {
    "__type__": "simple_dict",
    "__data__": {
      "index_id": "9343c0c6-8313-4549-a423-1348960b6958",
      "summary": null,
      "nodes_dict": {
        "ddb890e7-82d5-4dc5-a950-b498c3ef2494": "ddb890e7-82d5-4dc5-a950-b498c3ef2494",
        "5a03268d-b654-48a4-b3ea-b9620a5ec614": "5a03268d-b654-48a4-b3ea-b9620a5ec614",
        "8d29350a-a932-4665-b436-a810e66def48": "8d29350a-a932-4665-b436-a810e66def48",
        "6b151549-3bcc-4e11-a181-c1ac4dd34c8a": "6b151549-3bcc-4e11-a181-c1ac4dd34c8a",
        "e2ac2af1-b59f-4c83-b365-f0205ba1f732": "e2ac2af1-b59f-4c83-b365-f0205ba1f732",
        "307b2feb-5b0a-4b86-aec7-994074c94d29": "307b2feb-5b0a-4b86-aec7-994074c94d29"
      },
      "doc_id_dict": {
        "707af1fd-62af-4693-93e1-01433b900aca": [
          "ddb890e7-82d5-4dc5-a950-b498c3ef2494",
          "5a03268d-b654-48a4-b3ea-b9620a5ec614",
          "8d29350a-a932-4665-b436-a810e66def48",
          "6b151549-3bcc-4e11-a181-c1ac4dd34c8a",
          "e2ac2af1-b59f-4c83-b365-f0205ba1f732",
          "307b2feb-5b0a-4b86-aec7-994074c94d29"
        ]
      },
      "embeddings_dict": {}
    }
  }
}

The VectorStoreIndex class in the LlamaIndex framework is designed to work with an existing vector store, and it includes methods for embedding nodes, adding nodes to the index, and querying the index. The detailed information about the nodes and their embeddings is managed by the vector store and can be accessed through the vector store's API or by using the provided querying methods [1][2][3].

To continue talking to Dosu, mention @dosu.

logan-markewich commented 1 month ago

Stroage is split between an indec struct, docstroe, and vector store. This is all contained in the storage context.

If you use a vectordb integration. Only the vector store is used (everything is thrown in there) to simplify storage

haozhuoyuan commented 1 month ago

@logan-markewich Through vector store, if I want to delete some specific nodes by using the index.delete_nodes() method, how can I get the corresponding node id? I hope you can help me to answer it again