Closed JMMackenzie closed 1 year ago
Oh that's great! Thanks for the PR and the news (this makes me think that it will be possible to do some kind of pyterrier-pisa-splade)
I think so!
I believe the only difference would be to modify PyTerrier to provide support for an extra bool
going into the cursor generation. The PyTerrier team (cc @cmacdonald and @seanmacavaney) would just need to check out the API change here: https://github.com/pisa-engine/pisa/blob/master/include/pisa/cursor/max_scored_cursor.hpp#L34
The related PR is here: https://github.com/pisa-engine/pisa/pull/467
And then I think it's a matter of incorporating it somewhere like here (and all subsequent calls): https://github.com/terrierteam/pyterrier_pisa/blob/main/src/pyterrier_pisa/_pisathon.cpp#L271
PyTerrier-PISA may also then require a --weighted
flag like we use in PISA's tools which basically says "if a term appears n times in a query, then apply the term/document weight n times during scoring"
Cheers, Joel
Nice, Thanks for the update!
We already have a branch going for this here. We skip the fwd
index generation and instead generate inv
directly -- avoiding the (often costly) operation of building enormous strings with repeated terms, just to split them up again. But we got stuck at the query processing phase when we found that term repetition didn't weight query terms. The weighted
option is exactly what we need.
We'll pull the latest version of PISA and try to incorporate this soon.
- sean
Thanks @JMMackenzie, that's great!
Just an update to get the mainline of PISA if anyone uses this approach; obeying term weights is now in main.
Cheers!