Create a term proximity scorer

quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

MIT License

11.82k stars 657 forks source link

Create a term proximity scorer #247

Open fulmicoton opened 6 years ago

fulmicoton commented 6 years ago

Michael Jackson should get a better score if the two terms appear close to each other in the same document.

It would be nice to have a query that computes something equivalent to

"Michael Jackson"^3 OR (Michael AND jackson)

Conceptually it would just look like the phrasequery, except that the part that checks whether the phrase is present would only affect the score of the document.

audunhalland commented 4 years ago

Is the best way to achieve this right now a disjunction of a PhraseQuery and individual TermQueries?

E.g. in my search engine I would search for Michael Jackson, which is logically Michael OR Jackson. If I add a phrase to the equation like "Michael Jackson"^3 OR Michael OR Jackson, I would acheive a higher score for the exact phrase match.

But is the idea really about dynamically increasing score based on the proximity of the next term to the previous term ("term clustering")? E.g. 3 indexed texts Michael foo Jackson, Michael foo bar Jackson, Michael foo bar baz Jackson, the first one would score slightly better than the others.

fulmicoton commented 4 years ago

I do not know which flavor is the best... I don't think this is a one size fits all problem, so differnt user will ask for different things.

mocobeta commented 3 years ago

FYI... Maybe there is a corresponding issue on Lucene: https://issues.apache.org/jira/browse/LUCENE-3320 I've started to investigate it; I'd like to help or give feedback here if it works on Lucene.

fulmicoton commented 3 years ago

Sweet ! Looking forward to read your progress @mocobeta!

saroh commented 2 years ago

Conceptually it would just look like the phrasequery

You could base yourself on the phrase query with slop: Multiply the weight by the length of the query divided by the avg length of the matches or something like this ? So the more slop you need for a given match, the less it contributes to the score.

mustafa0x commented 10 months ago

I'm hoping to use quickwit/tantivy to search fairly long documents, so proximity boosting is a must for sensible results. Is this still not possible? Solr/lucene seems to support in some ways at least. Here is one:

improves proximity boosting by using word shingles

https://solr.apache.org/guide/solr/latest/query-guide/edismax-query-parser.html

fulmicoton commented 10 months ago

you will have to implement your own Query/Weight/Scorer.