jbtelice commented 3 weeks ago

Documentation Issue Description

There are misleading diagrams in the docs, in the VectorStoreIndex section:

In this diagram, embeddings are part of the node. But, what happens under the hood when you create a VectorStoreIndex, just from nodes where each node has a pre-computed embedding?

It is expected, each node preserves the embedding, right? That's not the case. Here is some proof:

import torch
from llama_index.core import VectorStoreIndex, Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.schema import TextNode

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Settings.llm = None

embed_model = HuggingFaceEmbedding(
    model_name="hiiamsid/sentence_similarity_spanish_es",
    device=device
)

Settings.embed_model = embed_model

data = ["Y volver, volver, volver. A tus brazos otra vez",
    "Mejico lindo y querido, si muero lejos de tí"
    ]

embed1 = embed_model.get_text_embedding(data[0])
embed2 = embed_model.get_text_embedding(data[1])

node1 = TextNode(text=data[0], metadata={"type":"Ranchera"}, embedding=embed1)
node2 = TextNode(text=data[1], metadata={"type":"Ranchera"}, embedding=embed2)

nodes = [node1, node2]

index = VectorStoreIndex(nodes=nodes)

print(nodes)
print(f"---- [vector_store_data]-------")
print(index.vector_store.data.embedding_dict)
print(f"---- [docstore_data]-------")
print(list(index.docstore.docs.items()))

This is what really happens under the hood:

It creates an StorageContext.from_defaults -> (SimpleVectorStore)
That StorageContext has a docstore ( missing the embedding information provided) and a vector_store(SimpleVectorStore) with the embeddings mapped into a dictionary.

WHY IS THIS MISLEADING?

Look at the next diagram:

When you query the index, it assumes the vector_store contains all the node information. As you can check in the example provided, that's not the case:

Related info:

Fix Proposal (In terms of consistency with the docs):

Patch

EDIT : The patch is provisional, but maybe, in order to handle this scenario properly, it should avoid the duplication in other ways:

Keep the embeddings in the docstore, and avoid creating a vector_store (Weird , because the naming: VectorStoreIndex). (store_nodes_override = True ?)

Documentation Link

https://docs.llamaindex.ai/en/stable/module_guides/indexing/index_guide/#vector-store-index

dosubot[bot] commented 3 weeks ago

I found a similar closed issue that might be relevant to your concern:

[Question]: Retriever documents, embeddings = None - The suggested solution was to use index.vector_store.get(node.node_id) to get the associated embedding ^[1].

The VectorStoreIndex in the LlamaIndex documentation does not preserve embeddings within each node. Instead, it creates vector embeddings of the text of every node, which are stored separately in the vector store. This discrepancy affects querying because the embeddings are used to facilitate semantic search, not stored within the nodes themselves ^[2].

To address the misleading documentation, the diagrams should be updated to reflect that embeddings are stored in the vector store and not within each node. This will help clarify the actual data management and querying process. Your proposed patch seems like a step in the right direction to ensure consistency with the documentation.

To continue talking to Dosu, mention @dosu.

jbtelice commented 3 weeks ago

Hi @dosu, this is the issue I've just mentioned :-(

dosubot[bot] commented 3 weeks ago

This is a tough nut to crack, could really use your help on this, @logan-markewich.

To continue talking to Dosu, mention @dosu.

logan-markewich commented 3 weeks ago

@jbtelice story time

the docstore holds the node text
the vector store manages the vectors
for most vector stores, the node is serialized directly into the vector store, along with the vector
the default simple vector store just stores an ID->vector mapping
store_nodes_override is meant to store nodes in a docstore even when using a vector store that stores text (this will duplicate all your text)
the reason the docstore doesn't store the embedding is because its already in the vector store, and doing so would also duplicate a lot of storage
I don't see this changing anytime soon. If the diagrams feel misleading, then we should update them

jbtelice commented 3 weeks ago

@logan-markewich , yep that's what I thought.

I think it would be nice just a footnote text to the diagram, explaining that exception to the rule, for the sake of clarity.

run-llama / llama_index

[Documentation]: Misleading Index Guide (VectorStoreIndex) #16292

Documentation Issue Description

Related info:

Fix Proposal (In terms of consistency with the docs):

Documentation Link