run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
37.07k stars 5.32k forks source link

[Feature Request]: Metadata Filters for BM25 Retrievers #14959

Open Timotheevin opened 4 months ago

Timotheevin commented 4 months ago

Feature Description

It seems there is no way to add metadata filters when initializing a BM25Retriever object. I am wondering if it would be possible to add this feature.

Reason

No idea why this is not already implemented, doesn't seem very difficult technically speaking.

Value of Feature

a VectorIndexRetriever is usually used along with a BM25Retriever, but only the VectorIndexRetriever can currently take a filter as argument.

logan-markewich commented 4 months ago

In theory it sounds easy, but I'm not sure if the bm25s library exposes an easy way to do this

At first glance, it feels like it doesn't

Timotheevin commented 4 months ago

Hi @logan-markewich,

Thanks for your answer. You're right, the bm25s library doesn't seem to implement this feature, but I was wondering if it was possible to deal with this on the wrapper side.

In llama-index/retrievers/bm25/base.py:39, there is :

self._corpus = [self._tokenizer(node.get_content()) for node in self._nodes]
self.bm25 = BM25Okapi(self._corpus)

would it be a feasible solution to apply the filter at this stage ? (i.e. filter the _nodes before those two lines depending on the metadata of each node)

Thanks

logan-markewich commented 4 months ago

Thats in the constructor though. Normally you'd want to filter per-retrieval no?

logan-markewich commented 4 months ago

If you want, you can filter the nodes before constructing the retriever then in that case

Timotheevin commented 3 months ago

Idk if you can give a filter as an argument at the retrieval stage. At least in this doc, it is done when building the retriever :

filters = MetadataFilters(
    filters=[
        MetadataFilter(key="theme", operator=FilterOperator.EQ, value="Mafia"),
    ]
)

retriever = index.as_retriever(filters=filters)
retriever.retrieve("What is inception about?")

But yes you're right, I can do it before constructing the retriever, I guess it's just a matter of where is the line between what feature is provided by the framework and what has to be implemented on the dev side.

AMindCoder commented 1 month ago

Any update on this issue? Looks like it is needed for most of the usecases when ever BM25 retriever is used