run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.55k stars 5.02k forks source link

[Question]: How can i get all nodes from the PGVectorDB? #13363

Open doanaktar opened 4 months ago

doanaktar commented 4 months ago

Question Validation

Question

Hi, I have stored a collection of nodes in a PGVector DB, I'd like to be able to retrieve all the nodes to retrieve docs with BM25 Retriever. I don't want to read the document every time i want to retrieve with bm25.

I've already tried to get it with:

index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
nodes = index.docstore.docs.values()
print("nodes: ", nodes)

nodes: dict_values([])

It returns an empty dict even it has a value in the table.

dosubot[bot] commented 4 months ago

To retrieve all nodes from your PGVectorDB for use with a BM25 Retriever, ensure your document store (docstore) is correctly initialized with documents from PGVectorDB. The BM25Retriever can be initialized with this document store using the from_defaults method, which will automatically handle the retrieval of nodes if the docstore parameter is provided.

Here's how you can initialize the BM25Retriever with a document store from PGVectorDB:

from llama_index.retrievers.bm25.base import BM25Retriever
from llama_index.vector_stores.postgres import PGVectorStore

# Assuming you have a PGVectorStore instance
vector_store = PGVectorStore.from_params(
    database="vector_db",
    host="localhost",
    password="password",
    port=5432,
    user="postgres",
    table_name="your_table_name",
    embed_dim=1536  # Dimension of your embeddings
)

# Initialize BM25Retriever with the document store
bm25_retriever = BM25Retriever.from_defaults(docstore=vector_store, tokenizer=None, verbose=True)

Make sure your PGVectorDB instance or wrapper correctly implements the BaseDocumentStore interface, especially the .docs attribute or method that returns documents in a format compatible with BaseNode. This will allow the BM25Retriever to access and use the documents stored in PGVectorDB for retrieval tasks efficiently.

#### Details **Sources** - [How to recover the list of nodes I used to create a PGVector vector store](https://github.com/run-llama/llama_index/discussions/11554) - [How to get nodes once stored in ChromaVectorStore or OpenSerachVectorStore.](https://github.com/run-llama/llama_index/discussions/11058) - [llama-index-legacy/llama_index/legacy/retrievers/bm25_retriever.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/retrievers/bm25_retriever.py) - [llama-index-integrations/retrievers/llama-index-retrievers-bm25/llama_index/retrievers/bm25/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/retrievers/llama-index-retrievers-bm25/llama_index/retrievers/bm25/base.py) - [llama-index-integrations/vector_stores/llama-index-vector-stores-postgres/llama_index/vector_stores/postgres/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/vector_stores/llama-index-vector-stores-postgres/llama_index/vector_stores/postgres/base.py) - [llama-index-legacy/llama_index/legacy/storage/docstore/keyval_docstore.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/storage/docstore/keyval_docstore.py) - [llama-index-legacy/llama_index/legacy/indices/vector_store/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/indices/vector_store/base.py) - [llama-index-core/llama_index/core/storage/docstore/types.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/storage/docstore/types.py) - [llama-index-core/llama_index/core/vector_stores/types.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/vector_stores/types.py) - [llama-index-core/llama_index/core/schema.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/schema.py) - [llama-index-core/llama_index/core/objects/base_node_mapping.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/objects/base_node_mapping.py) - [llama-index-legacy/llama_index/legacy/storage/docstore/types.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/storage/docstore/types.py) - [llama-index-core/llama_index/core/storage/storage_context.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/storage/storage_context.py) - [llama-index-legacy/llama_index/legacy/storage/storage_context.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/storage/storage_context.py)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

yzgrfsy commented 4 months ago

@doganaktarr 请问您这个问题解决了吗 ,我在使用ES数据库的时候也遇到这个问题,nodes: dict_values([]) ,

alebondarenko commented 3 months ago

My workaround for now is

# instantiate your PGVectorStore
vector_store = PGVectorStore.from_params(<your params>)

from llama_index.core.vector_stores.types import VectorStoreQuery

# Fetch e.g. 100 nodes from your DB
query_interim = VectorStoreQuery(query_str="your query", similarity_top_k=100)
nodes = vector_store.query(query=query_interim).nodes

# Pass the retrieved nodes to BM25Retriever which are used as a corpus
bm25_retriever = BM25Retriever.from_defaults(nodes=nodes, similarity_top_k=2)
niderhoff commented 1 month ago

hey @alebondarenko, @doganaktarr,

getting all the nodes from PGVectorDB kind of defeats the purpose of using the DB in the first place. If the node list is so small that it can fit into memory, there is really is no need to use a Postgres Backend.

On the other hand, if the amount of nodes is so large to warrant a postgres backend, I would not suggest using the BM25 retriever. In this approach we need to extract all the nodes from the postgres db into a python object, which takes a long time and needs a lot of memory. I would rather use postgres built-in fulltext search.

You can achieve that by adding hybrid_search=True to the retriever and Vector Store instances.

here is some pseudocode, note the marked arguments

from llama_index.vector_stores.postgres import PGVectorStore
from llama_index.core.vector_stores.types import VectorStoreQueryMode
from llama_index.core import VectorStoreIndex
vector_store = PGVectorStore.from_params(
    database="vector_db",
    host="localhost",
    password="password",
    port=5432,
    user="postgres",
    table_name="paul_graham_essay",
    embed_dim=1536,
    hybrid_search=True <----------
)
vector_store_index = VectorStoreIndex.from_vector_store(vector_store)
vector_store_index.as_retriever(
    vector_store_query_mode=VectorStoreQueryMode.HYBRID, <----------
) 
alebondarenko commented 1 month ago

thanks @niderhoff it's worth trying

niderhoff commented 1 month ago

thanks @niderhoff it's worth trying

One caveat though is that (afaik) Postgres’s full text search does not use any term document frequency statistics weighting algorithm1 (like BM25), which might affect results negatively.

alebondarenko commented 1 month ago

right, that's why the idea was to add some lexical retrieval: basically to refine the search results using some exact term matching.

niderhoff commented 1 month ago

@alebondarenko I have researched this topic a little more and it looks like I (maybe?) need to correct myself:

It does if you use ts_vector to store the TF, GIN to store the IDF and ts_query to query the data.

via https://stackoverflow.com/a/70455901

The article linked by the stackoverflow answer author states this also:

Postgres has three different concepts for interacting with a TF-IDF data. The first is the tsvector type which is used to store the TF values for each document that we have stored. The second is the tsquery type which is used to query for results in the vectors that we have created. The last piece of the puzzle is a GIN index that is used to store the IDF values for the entire table.

via https://codebots.com/crud/How-to-efficiently-search-text-using-Postgres-text-search

And according to the current version of llamaindex all 3 are used in the built-in hybrid search of the PGVectorStore class.


However, it is unclear to me if that is really the case as the answers on stackoverflow post linked above are contradictory. I also found other sources1 stating that postgres full-text search indeed does no tf-idf or similar.

The following should also be mentioned: I have read the statement that creating the index for BM25 is very time consuming for large corpora of data and hence it is not really advisable if we tend to do full-text search over a large number of documents.