seanmacavaney / pyterrier-caching

Caching components for PyTerrier
MIT License
2 stars 1 forks source link

Allow to input a data frame instead of an iterator for ScorerCache #6

Open janheinrichmerker opened 5 days ago

janheinrichmerker commented 5 days ago

For huge corpora like the ClueWebs, I'd like to avoid iterating over the full corpus. Rather, I'd just pass the output of some retriever applied to some topics to the scorer cache's build function. Is this setting supported?

seanmacavaney commented 16 hours ago

Thanks for reporting @janheinrichmerker!

I agree that this is burdensome. Right now you can alternatively pass a npids docid file into build to avoid iterating over the full collection. But to build the file, you'd have to have all docnos to begin with. (Perhaps there's a list of all ids somewhere?)

But even with this, I don't think the current ScorerCache implementation would work all that well for the ClueWebs, since it allocates a dense vector to store the scores for each document in the corpus. This will get huge. This design was based on my original use case, which involved scoring every document in msmarco with a cross encoder. But in hindsight, that's not a typical use case.

I'm thinking about an alternative implementation that stores as sparse vectors and avoids the build step entirely. Perhaps something that makes a dbm file for each query, with the file name as the query hash and the key as the docid. Or a sqlite database? What are your thoughts on this?

cc @Parry-Parry

janheinrichmerker commented 15 hours ago

Nice hearing your original thoughts on it :+1: I agree that the current implementation is perfect for your given use case. My use case would be more like this:

I want to benchmark combinations of a bunch of retrievers (cached via RetrieverCache) and re-rankers similar to running a grid search. (One could also think of actually finding the optimal configuration in a grid search.)

With different retrievers, still very often the same documents are retrieved, or at least there is some common subset. So when re-ranking the results of the second retriever, the re-ranking would just need to score the new documents that were not retrieved by the first retriever. From what I can see, the dbm idea looks perfect for that. I haven't used Python's dbm yet, but it seems nice that it automatically switches the backend as needed, although that could also cause issues if the cache is used on a different device later on. Maybe then the sqlite option is the most portable way to implement it, plus it should be faster than dbm's "dumb" implementation.

janheinrichmerker commented 15 hours ago

Maybe just explicitly using dbm.sqlite3 would solve it?

seanmacavaney commented 14 hours ago

My use case would be more like this:

Great- I think this use case is probably way more common than the one I had.

Maybe just explicitly using dbm.sqlite3 would solve it?

I'll give it a try when I have the chance!

seanmacavaney commented 6 hours ago

I've put together an implementation that uses dbm.sqlite3 and avoids the need for a first pass in #7. Can you take a look when you have the chance?

janheinrichmerker commented 4 hours ago

Very nice indeed! I only now noticed that the dbm.sqlite3 module is only available from Python 3.13 onwards. Then, it might even be easier to just use pure SQLite right away or use some abstraction ontop of SQLite that you don't need to maintain for yourself. I have used diskcache in the past (e.g., for the caching in ir_axioms). It is very stable and fast, and would even allow for some other shenanigans like cache expiration and eviction with LRU. Do you think it could be worthwhile looking at that library?