run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.23k stars 4.95k forks source link

[Question]: PGVectorStore: how to implement text search on metadata? #13743

Closed coffepowered closed 3 days ago

coffepowered commented 3 months ago

Question Validation

Question

Hi, I know that the PGVectorStore supports both text embedding and full text search and that they can be combined using a fusion retriever to achieve any flavour of hybrid search one may be interested in.

I see that apache lucene and other engines however support combination of embedding retrieval and full text search on any field (metadata) one is willing to index.

For instance: let's suppose one wants to do embedding search on the node + full text search on its "title" metadata. Do you think this is achievable ATM? If not, what would be the best pattern to implement this?

Perhaps subclassing the PGVectorStore+creating a custom retriever?

Thanks in advance

P.S. tagging @juleskuehn - as I feel he may be expert on this :)

dosubot[bot] commented 3 months ago

To implement text search on metadata in PGVectorStore, you can indeed subclass PGVectorStore and create a custom retriever. Here are the steps to achieve this:

  1. Subclass PGVectorStore:

    from llama_index.legacy.vector_stores.pgvecto_rs import PGVectoRsStore
    
    class CustomPGVectorStore(PGVectoRsStore):
        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
  2. Override the _hybrid_query method to include full-text search on metadata fields:

    from typing import List, Any
    from llama_index.legacy.vector_stores.types import VectorStoreQuery, DBEmbeddingRow
    
    class CustomPGVectorStore(PGVectoRsStore):
        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
    
        def _hybrid_query(
            self, query: VectorStoreQuery, **kwargs: Any
        ) -> List[DBEmbeddingRow]:
            if query.alpha is not None:
                logger.warning("postgres hybrid search does not support alpha parameter.")
    
            sparse_top_k = query.sparse_top_k or query.similarity_top_k
    
            # Perform dense (embedding-based) search
            dense_results = self._query_with_score(
                query.query_embedding,
                query.similarity_top_k,
                query.filters,
                **kwargs,
            )
    
            # Perform sparse (full-text) search on metadata
            sparse_results = self._sparse_query_with_rank(
                query.query_str, sparse_top_k, query.filters
            )
    
            # Combine and deduplicate results
            all_results = dense_results + sparse_results
            return self._dedup_results(all_results)
  3. Override the _async_hybrid_query method for asynchronous support:

    import asyncio
    from typing import List, Any
    from llama_index.legacy.vector_stores.types import VectorStoreQuery, DBEmbeddingRow
    
    class CustomPGVectorStore(PGVectoRsStore):
        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
    
        async def _async_hybrid_query(
            self, query: VectorStoreQuery, **kwargs: Any
        ) -> List[DBEmbeddingRow]:
            if query.alpha is not None:
                logger.warning("postgres hybrid search does not support alpha parameter.")
    
            sparse_top_k = query.sparse_top_k or query.similarity_top_k
    
            # Perform dense and sparse searches concurrently
            results = await asyncio.gather(
                self._aquery_with_score(
                    query.query_embedding,
                    query.similarity_top_k,
                    query.filters,
                    **kwargs,
                ),
                self._async_sparse_query_with_rank(
                    query.query_str, sparse_top_k, query.filters
                ),
            )
    
            dense_results, sparse_results = results
            all_results = dense_results + sparse_results
            return self._dedup_results(all_results)
  4. Use the custom subclass:

    custom_vector_store = CustomPGVectorStore.from_params(
        database="vector_db",
        host="localhost",
        password="password",
        port=5432,
        user="postgres",
        table_name="paul_graham_essay",
        embed_dim=1536  # openai embedding dimension
    )

By following these steps, you can implement text search on metadata fields in PGVectorStore and combine it with embedding search for a hybrid search solution [1][2][3].

#### Details **About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

juleskuehn commented 3 months ago

Thanks for tagging me. I'm looking into this myself right now also.

I think you have the right idea. There is no such functionality available in the PGVectorStore class so you have to build it:

subclassing the PGVectorStore+creating a custom retriever

I might start by making a copy of vector_stores/postgres/base.py rather than subclassing, because this will require modifying a bunch of things, e.g. the SQLalchemy data model. You'll need to create an index for the metadata fields you want to full-text search.

I would prefer to have a separate actual text column for each metadata field I wanted to search - which is more structure than the current approach, where all metadata lives in a JSON field (I believe this decision was driven by desire for LlamaIndex compatibility with multiple vector store backends).

Let me know what you come up with!

coffepowered commented 3 months ago

Thanks for your answer. I am actually wondering wether this is the correct approach (in principle). In an ideal scenario, I'd like to offload this sort of thing to the vector db itself since the retrieval could get tricky quite quickly.

However, the combination of postgres extensions that would be ideal in this context (IMHO: pgvector+pg_search) is unavailable on the cloud vendor I am using due to licensing issues.

If I decide to do some tests, I'll share them here since you are also interested!

juleskuehn commented 3 months ago

There are a bunch of considerations in a system like you are describing, e.g. how are the results ranked? Which is more important - the title search or the full-text embedding search? Is it possible you'll want vector or hybrid search on the title field as well? Worth checking out this Vespa blog: https://blog.vespa.ai/when-you-are-using-vectors-you-are-doing-search/

If you have requirements beyond what LlamaIndex VectorStore classes provide, it may be better to start with implementing the search mechanism in a way that works well for your specific application, and afterwards, make an interface for LlamaIndex. For the database, I understand about the extensions in managed DB. You could try running ParadeDB in a Docker container, or just make do with the native Postgres text search.