Retriever cache calls transform once per query

janheinrichmerker commented 3 weeks ago

The current RetrieverCache implementation calls the transform() function of the wrapped Transformer once for each query in the input data frame, which can be costly for some retrieval models due to breaking parallelizability. I think it should also be possible to batch-retrieve all cache misses in one call of transform() and then store the results by grouping per qid.

janheinrichmerker commented 3 weeks ago

Oh, and by the way, the caching works great, and it's super nice to have a way to specify the cache directory as well :+1: Thanks for the great library!

seanmacavaney commented 2 weeks ago

Great point! I agree that this should be doable and be a preferable implementation.

seanmacavaney commented 1 week ago

I think this is sorted now with the SparseScorerCache!

janheinrichmerker commented 6 days ago

The SparseScorerCache is absolutely fantastic! But I think for the RetrieverCache (or in particular the DbmRetrieverCache), still results are fetched one query at a time. As the order of queries does not matter in the result dataframe, I would say, that in these lines... https://github.com/seanmacavaney/pyterrier-caching/blob/7963591fcfee17aaf53a4b4e5d4ae7825d0e9188/pyterrier_caching/retriever_cache.py#L68-L72 ...the call to the retriever could essentially be moved outside the for loop, as in:

Select the sub-dataframe where the self.on columns are any of it (to_retrieve).
For that dataframe, call the retriever's transform() once.
From the resulting dataframe, group by the self.on columns again, and cache each group's dataframe.

seanmacavaney commented 6 days ago

Whoops, right! I'll re-open.

seanmacavaney / pyterrier-caching

Retriever cache calls transform once per query #5