run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
33.46k stars 4.69k forks source link

[Question]: How do I prioritize exact matches in my retriever? (BM25 or any other) #14120

Open pliablepixels opened 3 weeks ago

pliablepixels commented 3 weeks ago

Question Validation

Question

Hi, I have a (typical) use-case where vector index mostly works, but there are times when I provide a query that has an exact match to a chunk and I don't get that exact match. To work around, I implemented a QueryFusionRetriever that combines bm25 (0.4) with vector (0.6). When I experimented with just the bm25 on its own, I noticed it was still not getting all the exact matches I wanted. I read that tuning down k1 will make it less biased towards more occurrences of the general terms in the queries, but I don't see a way to change k1 when I instantiate BM25Retriever.from_defaults(nodes=nodes, ...)

Questions:

  1. Is there a way to experiment with k1?
  2. Would you recommend a different retrieve engine that can work on the nodes and one that is more biased towards text matches rather than frequency over and above BM25?

Thanks

Here is an example of the document (sample)

###
Q: What is foo?
A: Foo stands for something funny
###
Q: Who are you?
A: I am all of us
###
<Many more Q/A where foo occurs>

With BM25, when I retrieve("What is foo?") it biases towards other Q/A that has more foo occurrences when I'd want it to return the top Q/A with a higher score.

dosubot[bot] commented 3 weeks ago
  1. You can experiment with the k1 parameter in BM25 by using the rank_bm25 library, which the BM25Retriever class from LlamaIndex relies on. Here is an example of how you can modify the BM25Retriever class to include the k1 parameter:

    from rank_bm25 import BM25Okapi
    
    class BM25Retriever(BaseRetriever):
        def __init__(
            self,
            nodes: List[BaseNode],
            tokenizer: Optional[Callable[[str], List[str]]],
            similarity_top_k: int = DEFAULT_SIMILARITY_TOP_K,
            k1: float = 1.5,  # Default value for k1
            callback_manager: Optional[CallbackManager] = None,
            objects: Optional[List[IndexNode]] = None,
            object_map: Optional[dict] = None,
            verbose: bool = False,
        ) -> None:
            try:
                from rank_bm25 import BM25Okapi
            except ImportError:
                raise ImportError("Please install rank_bm25: pip install rank-bm25")
    
            self._nodes = nodes
            self._tokenizer = tokenizer or tokenize_remove_stopwords
            self._similarity_top_k = similarity_top_k
            self._corpus = [self._tokenizer(node.get_content()) for node in self._nodes]
            self.bm25 = BM25Okapi(self._corpus, k1=k1)  # Set k1 parameter here
            super().__init__(
                callback_manager=callback_manager,
                object_map=object_map,
                objects=objects,
                verbose=verbose,
            )
  2. For a retrieval engine that is more biased towards text matches rather than term frequency, you might consider using a vector-based retrieval method such as those provided by dense embedding models (e.g., Sentence Transformers). These models focus more on semantic similarity rather than term frequency. LlamaIndex supports vector-based retrieval methods, which can be more effective for capturing semantic meaning in text. You can explore using the VectorStoreIndex class in LlamaIndex for this purpose [1][2].

Additionally, you can use the MetadataFilters class to handle exact matches in VectorStoreIndex. Here is an example:

```python
from llama_index.legacy.vector_stores.types import VectorStoreQuery, VectorStoreQueryMode, MetadataFilters, MetadataFilter, FilterOperator

# Create exact match filters
filters = MetadataFilters(filters=[
    MetadataFilter(key="field_name", value="exact_value", operator=FilterOperator.EQ)
])

# Create a query with the TEXT_SEARCH mode to prioritize exact matches
query = VectorStoreQuery(
    query_str="your query",
    mode=VectorStoreQueryMode.TEXT_SEARCH,
    filters=filters
)

# Execute the query
result = vector_store.query(query)
```

This configuration ensures that the query prioritizes exact matches by using the TEXT_SEARCH mode and applying the exact match filters [3].

To continue talking to Dosu, mention @dosu.