Weighting is now mainline in PISA

JMMackenzie commented 1 year ago

Just an update to get the mainline of PISA if anyone uses this approach; obeying term weights is now in main.

Cheers!

cadurosar commented 1 year ago

Oh that's great! Thanks for the PR and the news (this makes me think that it will be possible to do some kind of pyterrier-pisa-splade)

JMMackenzie commented 1 year ago

I think so!

I believe the only difference would be to modify PyTerrier to provide support for an extra bool going into the cursor generation. The PyTerrier team (cc @cmacdonald and @seanmacavaney) would just need to check out the API change here: https://github.com/pisa-engine/pisa/blob/master/include/pisa/cursor/max_scored_cursor.hpp#L34

The related PR is here: https://github.com/pisa-engine/pisa/pull/467

And then I think it's a matter of incorporating it somewhere like here (and all subsequent calls): https://github.com/terrierteam/pyterrier_pisa/blob/main/src/pyterrier_pisa/_pisathon.cpp#L271

PyTerrier-PISA may also then require a --weighted flag like we use in PISA's tools which basically says "if a term appears n times in a query, then apply the term/document weight n times during scoring"

Cheers, Joel

seanmacavaney commented 1 year ago

Nice, Thanks for the update!

We already have a branch going for this here. We skip the fwd index generation and instead generate inv directly -- avoiding the (often costly) operation of building enormous strings with repeated terms, just to split them up again. But we got stuck at the query processing phase when we found that term repetition didn't weight query terms. The weighted option is exactly what we need.

We'll pull the latest version of PISA and try to incorporate this soon.

- sean

thibault-formal commented 1 year ago

Thanks @JMMackenzie, that's great!

naver / splade

Weighting is now mainline in PISA #23