run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.38k stars 4.98k forks source link

[Feature Request]: Store-agnostic interface for retrieving document embeddings from vector index #15230

Open namedgraph opened 1 month ago

namedgraph commented 1 month ago

Feature Description

I'm thinking about a simple VectorStore method like:

get(doc_id: str) -> BaseNode # assuming it contains embeddings?

or smth like that

Reason

I was using this code with VectorStoreIndex and the default vector store:

for doc_id, doc in vector_index.docstore.docs.items():
    embedding = vector_index._vector_store._data.embedding_dict[doc_id]

Using the private _vector_index field was already smelly, and then this approach broke when I switched to the FaissVectorStore, because it has no attribute _data.

Value of Feature

There are use cases such as vector index cross-linking which requires iterating documents in one index, retrieving their embeddings and then querying the other index using those embeddings

logan-markewich commented 1 month ago

There is already a get_nodes method on the base class. That could probably be updated to ensure that it actually returns embeddings (note that this method doesn't work on the default vector store, because the default vector store isn't storing embeddings)

namedgraph commented 1 month ago

@logan-markewich isn't it? What was I getting from the default vector store this way then? It sure looked like embeddings :)

vector_index._vector_store._data.embedding_dict[doc_id]

ensure that it actually returns embeddings

I think this is the crucial part.

SimpleVectorStore.get(text_id) looks like the method I need. Getting embedding of a single document would be fine. The problem is that this method is not present in the VectorStore superclass, therefore 3rd party vector store implementations (like FAISS) do not implement it :/ Isn't that so?

logan-markewich commented 1 month ago

That just returns the embeddings, not the node (in the issue above, it looked like you were describing a method to get nodes + embeddings)

logan-markewich commented 1 month ago

The base class is BasePydanticVectorStore in latest versions, VectorStore is deprecated (and even removed? I forget)

namedgraph commented 1 month ago

Yes I see BasePydanticVectorStore.get_nodes() now, thanks. But it's not implemented :)

NotImplementedError: get_nodes not implemented