Open pliablepixels opened 3 weeks ago
You can experiment with the k1
parameter in BM25 by using the rank_bm25
library, which the BM25Retriever
class from LlamaIndex relies on. Here is an example of how you can modify the BM25Retriever
class to include the k1
parameter:
from rank_bm25 import BM25Okapi
class BM25Retriever(BaseRetriever):
def __init__(
self,
nodes: List[BaseNode],
tokenizer: Optional[Callable[[str], List[str]]],
similarity_top_k: int = DEFAULT_SIMILARITY_TOP_K,
k1: float = 1.5, # Default value for k1
callback_manager: Optional[CallbackManager] = None,
objects: Optional[List[IndexNode]] = None,
object_map: Optional[dict] = None,
verbose: bool = False,
) -> None:
try:
from rank_bm25 import BM25Okapi
except ImportError:
raise ImportError("Please install rank_bm25: pip install rank-bm25")
self._nodes = nodes
self._tokenizer = tokenizer or tokenize_remove_stopwords
self._similarity_top_k = similarity_top_k
self._corpus = [self._tokenizer(node.get_content()) for node in self._nodes]
self.bm25 = BM25Okapi(self._corpus, k1=k1) # Set k1 parameter here
super().__init__(
callback_manager=callback_manager,
object_map=object_map,
objects=objects,
verbose=verbose,
)
For a retrieval engine that is more biased towards text matches rather than term frequency, you might consider using a vector-based retrieval method such as those provided by dense embedding models (e.g., Sentence Transformers). These models focus more on semantic similarity rather than term frequency. LlamaIndex supports vector-based retrieval methods, which can be more effective for capturing semantic meaning in text. You can explore using the VectorStoreIndex
class in LlamaIndex for this purpose [1][2].
Additionally, you can use the MetadataFilters
class to handle exact matches in VectorStoreIndex
. Here is an example:
```python
from llama_index.legacy.vector_stores.types import VectorStoreQuery, VectorStoreQueryMode, MetadataFilters, MetadataFilter, FilterOperator
# Create exact match filters
filters = MetadataFilters(filters=[
MetadataFilter(key="field_name", value="exact_value", operator=FilterOperator.EQ)
])
# Create a query with the TEXT_SEARCH mode to prioritize exact matches
query = VectorStoreQuery(
query_str="your query",
mode=VectorStoreQueryMode.TEXT_SEARCH,
filters=filters
)
# Execute the query
result = vector_store.query(query)
```
This configuration ensures that the query prioritizes exact matches by using the TEXT_SEARCH
mode and applying the exact match filters [3].
To continue talking to Dosu, mention @dosu.
Question Validation
Question
Hi, I have a (typical) use-case where vector index mostly works, but there are times when I provide a query that has an exact match to a chunk and I don't get that exact match. To work around, I implemented a QueryFusionRetriever that combines bm25 (0.4) with vector (0.6). When I experimented with just the bm25 on its own, I noticed it was still not getting all the exact matches I wanted. I read that tuning down
k1
will make it less biased towards more occurrences of the general terms in the queries, but I don't see a way to changek1
when I instantiateBM25Retriever.from_defaults(nodes=nodes, ...)
Questions:
k1
?Thanks
Here is an example of the document (sample)
With BM25, when I
retrieve("What is foo?")
it biases towards other Q/A that has more foo occurrences when I'd want it to return the top Q/A with a higher score.