run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.31k stars 5.18k forks source link

[Documentation]: Misleading Index Guide (VectorStoreIndex) #16292

Open jbtelice opened 3 weeks ago

jbtelice commented 3 weeks ago

Documentation Issue Description

There are misleading diagrams in the docs, in the VectorStoreIndex section:

vector_store

In this diagram, embeddings are part of the node. But, what happens under the hood when you create a VectorStoreIndex, just from nodes where each node has a pre-computed embedding?

It is expected, each node preserves the embedding, right? That's not the case. Here is some proof:

import torch
from llama_index.core import VectorStoreIndex, Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.schema import TextNode

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Settings.llm = None

embed_model = HuggingFaceEmbedding(
    model_name="hiiamsid/sentence_similarity_spanish_es",
    device=device
)

Settings.embed_model = embed_model

data = ["Y volver, volver, volver. A tus brazos otra vez",
    "Mejico lindo y querido, si muero lejos de tí"
    ]

embed1 = embed_model.get_text_embedding(data[0])
embed2 = embed_model.get_text_embedding(data[1])

node1 = TextNode(text=data[0], metadata={"type":"Ranchera"}, embedding=embed1)
node2 = TextNode(text=data[1], metadata={"type":"Ranchera"}, embedding=embed2)

nodes = [node1, node2]

index = VectorStoreIndex(nodes=nodes)

print(nodes)
print(f"---- [vector_store_data]-------")
print(index.vector_store.data.embedding_dict)
print(f"---- [docstore_data]-------")
print(list(index.docstore.docs.items()))

This is what really happens under the hood:

WHY IS THIS MISLEADING?

Look at the next diagram:

vector_store_query

When you query the index, it assumes the vector_store contains all the node information. As you can check in the example provided, that's not the case:

Related info:

Fix Proposal (In terms of consistency with the docs):

Patch

EDIT : The patch is provisional, but maybe, in order to handle this scenario properly, it should avoid the duplication in other ways:

Documentation Link

https://docs.llamaindex.ai/en/stable/module_guides/indexing/index_guide/#vector-store-index

dosubot[bot] commented 3 weeks ago

I found a similar closed issue that might be relevant to your concern:

The VectorStoreIndex in the LlamaIndex documentation does not preserve embeddings within each node. Instead, it creates vector embeddings of the text of every node, which are stored separately in the vector store. This discrepancy affects querying because the embeddings are used to facilitate semantic search, not stored within the nodes themselves [2].

To address the misleading documentation, the diagrams should be updated to reflect that embeddings are stored in the vector store and not within each node. This will help clarify the actual data management and querying process. Your proposed patch seems like a step in the right direction to ensure consistency with the documentation.

To continue talking to Dosu, mention @dosu.

jbtelice commented 3 weeks ago

Hi @dosu, this is the issue I've just mentioned :-(

dosubot[bot] commented 3 weeks ago

This is a tough nut to crack, could really use your help on this, @logan-markewich.

To continue talking to Dosu, mention @dosu.

logan-markewich commented 3 weeks ago

@jbtelice story time

jbtelice commented 3 weeks ago

@logan-markewich , yep that's what I thought.