Open janheinrichmerker opened 3 weeks ago
Oh, and by the way, the caching works great, and it's super nice to have a way to specify the cache directory as well :+1: Thanks for the great library!
Great point! I agree that this should be doable and be a preferable implementation.
I think this is sorted now with the SparseScorerCache!
The SparseScorerCache
is absolutely fantastic! But I think for the RetrieverCache
(or in particular the DbmRetrieverCache
), still results are fetched one query at a time. As the order of queries does not matter in the result dataframe, I would say, that in these lines...
https://github.com/seanmacavaney/pyterrier-caching/blob/7963591fcfee17aaf53a4b4e5d4ae7825d0e9188/pyterrier_caching/retriever_cache.py#L68-L72
...the call to the retriever
could essentially be moved outside the for
loop, as in:
self.on
columns are any of it
(to_retrieve
).retriever
's transform()
once.self.on
columns again, and cache each group's dataframe.Whoops, right! I'll re-open.
The current
RetrieverCache
implementation calls thetransform()
function of the wrappedTransformer
once for each query in the input data frame, which can be costly for some retrieval models due to breaking parallelizability. I think it should also be possible to batch-retrieve all cache misses in one call oftransform()
and then store the results by grouping perqid
.