manticoresoftware / manticoresearch

Easy to use open source fast database for search | Good alternative to Elasticsearch now | Drop-in replacement for E in the ELK soon
https://manticoresearch.com
GNU General Public License v3.0
8.81k stars 488 forks source link

IDF booster does not boost the word #2401

Closed donhardman closed 1 month ago

donhardman commented 1 month ago

Bug Description:

In some instances, when we attempt to use IDF boosters, the weight isn't applied to certain phrases in the match query as expected. Here's a Minimal Reproducible Example (MRE) to demonstrate this issue:

mysql> drop table if exists t; create table t(f text); insert into t(f) values('a'),('b'),('a'); select *, weight() from t where match('a|b'); select *, weight() from t where match('a^0.1|b^1.9'); select *, weight() from t where match('a^1.9|b^0.1'); select *, weight() from t where match('b^1.9|a^0.1');
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.08 sec)

--------------
create table t(f text)
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
insert into t(f) values('a'),('b'),('a')
--------------

Query OK, 3 rows affected (0.00 sec)

--------------
select *, weight() from t where match('a|b')
--------------

+---------------------+------+----------+
| id                  | f    | weight() |
+---------------------+------+----------+
| 5839413763366715531 | b    |     1590 |
| 5839413763366715530 | a    |     1500 |
| 5839413763366715532 | a    |     1500 |
+---------------------+------+----------+
3 rows in set (0.00 sec)
--- 3 out of 3 results in 0ms ---

--------------
select *, weight() from t where match('a^0.1|b^1.9')
--------------

+---------------------+------+----------+
| id                  | f    | weight() |
+---------------------+------+----------+
| 5839413763366715531 | b    |     1671 |
| 5839413763366715530 | a    |     1500 |
| 5839413763366715532 | a    |     1500 |
+---------------------+------+----------+
3 rows in set (0.00 sec)
--- 3 out of 3 results in 0ms ---

--------------
select *, weight() from t where match('a^1.9|b^0.1')
--------------

+---------------------+------+----------+
| id                  | f    | weight() |
+---------------------+------+----------+
| 5839413763366715531 | b    |     1509 |
| 5839413763366715530 | a    |     1500 |
| 5839413763366715532 | a    |     1500 |
+---------------------+------+----------+
3 rows in set (0.00 sec)
--- 3 out of 3 results in 0ms ---

--------------
select *, weight() from t where match('b^1.9|a^0.1')
--------------

+---------------------+------+----------+
| id                  | f    | weight() |
+---------------------+------+----------+
| 5839413763366715531 | b    |     1671 |
| 5839413763366715530 | a    |     1500 |
| 5839413763366715532 | a    |     1500 |
+---------------------+------+----------+
3 rows in set (0.00 sec)
--- 3 out of 3 results in 0ms ---

Interestingly, when we set the option idf='plain,tfidf_normalized', the weights for all the same queries in the MRE change and work as expected. This suggests there might be an issue with the default IDF score calculation.

Manticore Search Version:

Latest dev version

Operating System Version:

Ubuntu Jammy

Have you tried the latest development version?

Yes

Internal Checklist:

To be completed by the assignee. Check off tasks that have been completed or are not applicable.

- [ ] Implementation completed - [ ] Tests developed - [ ] Documentation updated - [ ] Documentation reviewed - [ ] Changelog updated
sanikolaev commented 1 month ago

@donhardman as discussed, pls update the example.

tomatolog commented 1 month ago

I do not see the issue here - boost correctly set for both terms then it get into count at sphinxsearch.cpp#L4620

The only issue I see that for non plain version fIDF got 0.0 for a term due to the formula at sphinxsearch.cpp#L4600

if ( !tQuery.m_bPlainIDF )
{
    // bm25 variant, idf = log((N-n+1)/n), as per Robertson et al
    //
    // idf \in [-log(N), log(N)]
    // weight \in [-NumWords*log(N), NumWords*log(N)]
    // we want weight \in [0, 1] range
    // we prescale idfs and get weight \in [-0.5, 0.5] range
    // then add 0.5 as our final step
    //
    // for the record, idf = log((N-n+0.5)/(n+0.5)) in the original paper
    // but our variant is a bit easier to compute, and has a better (symmetric) range
    float fLogTotal = logf ( float ( 1+iTotalClamped ) );
    fIDF = logf ( float ( iTotalClamped-iTermDocs+1 ) / float ( iTermDocs ) )
        / ( 2*fLogTotal );

with the values from the case

fLogTotal = 1.38629436
iTotalClamped = 3
iTermDocs= 2
fIDF = logf ( float ( 3-2+1 ) / float ( 2 ) ) / ( 2*1.38629436 )
fIDF = logf ( float ( 2 ) / float ( 2 ) ) / ( 2*1.38629436 )
fIDF = logf ( 1 ) / ( 2*1.38629436 )
fIDF = 0 / ( 2*1.38629436 )
fIDF = 0

but it seems due to the artificially small index not sure this exact case should be fixed as it could affect whole matching

donhardman commented 1 month ago

I can confirm that when a table contains sufficient rows, we can observe changes in the weights. Closing the issue.