manticoresoftware / manticoresearch

Easy to use open source fast database for search | Good alternative to Elasticsearch now | Drop-in replacement for E in the ELK soon
https://manticoresearch.com
GNU General Public License v3.0
8.87k stars 490 forks source link

Match in a wrong field #1140

Closed sanikolaev closed 1 year ago

sanikolaev commented 1 year ago

Below you can see that the query @f2 abc finds the document with f2 = def which is wrong:

mysql> drop table if exists t; create table t (f1 text, f2 text) morphology='lemmatize_de_all' morphology_skip_fields='f2'; replace into t(f1, f2) values ('abc', 'def'); SELECT * FROM t where match ('@f2 abc'); show meta;  
--------------  
drop table if exists t  
--------------  

Query OK, 0 rows affected (0.01 sec)  

--------------  
create table t (f1 text, f2 text) morphology='lemmatize_de_all' morphology_skip_fields='f2'  
--------------  

Query OK, 0 rows affected, 1 warning (0.01 sec)  

--------------  
replace into t(f1, f2) values ('abc', 'def')  
--------------  

Query OK, 1 row affected (0.00 sec)  

--------------  
SELECT * FROM t where match ('@f2 abc')  
--------------  

 --------------------- ------ ------   
| id                  | f1   | f2   |  
 --------------------- ------ ------   
| 1515364055206854666 | abc  | def  |  
 --------------------- ------ ------   
1 row in set (0.00 sec)  

--------------  
show meta  
--------------  

 ---------------- -------   
| Variable_name  | Value |  
 ---------------- -------   
| total          | 1     |  
| total_found    | 1     |  
| total_relation | eq    |  
| time           | 0.000 |  
| keyword[0]     | =abc  |  
| docs[0]        | 1     |  
| hits[0]        | 2     |  
| keyword[1]     | abc   |  
| docs[1]        | 1     |  
| hits[1]        | 1     |  
 ---------------- -------   
10 rows in set (0.00 sec)  

Notes:

githubmanticore commented 1 year ago

➤ Sergey Nikolaev commented:

Another MRE:

mysql> drop table if exists t; create table t (f1 text, f2 text) morphology='lemmatize_de_all' morphology_skip_fields='f2'; replace into t(f1, f2) values ('peter', 'john'); SELECT * FROM t where match ('@f2 peter'); show meta; 
-------------- 
drop table if exists t 
-------------- 

Query OK, 0 rows affected (0.02 sec) 

-------------- 
create table t (f1 text, f2 text) morphology='lemmatize_de_all' morphology_skip_fields='f2' 
-------------- 

Query OK, 0 rows affected, 1 warning (0.01 sec) 

-------------- 
replace into t(f1, f2) values ('peter', 'john') 
-------------- 

Query OK, 1 row affected (0.01 sec) 

-------------- 
SELECT * FROM t where match ('@f2 peter') 
-------------- 

+---------------------+-------+------+ 
| id                  | f1    | f2   | 
+---------------------+-------+------+ 
| 1515374047901253834 | peter | john | 
+---------------------+-------+------+ 
1 row in set (0.00 sec) 

-------------- 
show meta 
-------------- 

+----------------+--------+ 
| Variable_name  | Value  | 
+----------------+--------+ 
| total          | 1      | 
| total_found    | 1      | 
| total_relation | eq     | 
| time           | 0.000  | 
| keyword[0]     | =peter | 
| docs[0]        | 1      | 
| hits[0]        | 2      | 
| keyword[1]     | peter  | 
| docs[1]        | 1      | 
| hits[1]        | 1      | 
+----------------+--------+ 
10 rows in set (0.00 sec) 
githubmanticore commented 1 year ago

➤ Aleksey N. Vinogradov commented:

Can't reproduce at all.

mysql> drop table if exists t; create table t (f1 text, f2 text) morphology='lemmatize_de_all'; replace into t(f1, f2) values ('peter', 'john'); SELECT * FROM t where match ('@f2 peter'); show meta; 
Query OK, 0 rows affected (0,00 sec) 

Query OK, 0 rows affected (0,00 sec) 

Query OK, 1 row affected (0,01 sec) 

Empty set (0,00 sec) 

+----------------+-------+ 
| Variable_name  | Value | 
+----------------+-------+ 
| total          | 0     | 
| total_found    | 0     | 
| total_relation | eq    | 
| time           | 0.000 | 
| keyword[0]     | peter | 
| docs[0]        | 1     | 
| hits[0]        | 1     | 
+----------------+-------+ 
7 rows in set (0,00 sec) 

mysql> drop table if exists t; create table t (f1 text, f2 text) morphology='lemmatize_de_all' morphology_skip_fields='f2'; replace into t(f1, f2) values ('peter', 'john'); SELECT * FROM t where match ('@f2 peter'); show meta; 
Query OK, 0 rows affected (0,00 sec) 

Query OK, 0 rows affected (0,00 sec) 

Query OK, 1 row affected (0,01 sec) 

Empty set (0,00 sec) 

+----------------+-------+ 
| Variable_name  | Value | 
+----------------+-------+ 
| total          | 0     | 
| total_found    | 0     | 
| total_relation | eq    | 
| time           | 0.000 | 
| keyword[0]     | peter | 
| docs[0]        | 1     | 
| hits[0]        | 1     | 
+----------------+-------+ 
7 rows in set (0,00 sec) 
sanikolaev commented 1 year ago

Strange. I can't reproduce the peter/john case anymore too.

But this still fails for me in the latest dev version on mac m1:

mysql> drop table if exists t; create table t (f1 text, f2 text) morphology='lemmatize_de_all' morphology_skip_fields='f2'; replace into t(f1, f2) values ('abc', 'def'); select * from t; SELECT * FROM t where match ('@f2 abc'); show meta; call keywords('abc', 't'); call keywords('def', 't');
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
create table t (f1 text, f2 text) morphology='lemmatize_de_all' morphology_skip_fields='f2'
--------------

Query OK, 0 rows affected, 1 warning (0.01 sec)

--------------
replace into t(f1, f2) values ('abc', 'def')
--------------

Query OK, 1 row affected (0.00 sec)

--------------
select * from t
--------------

+---------------------+------+------+
| id                  | f1   | f2   |
+---------------------+------+------+
| 1515384550488801301 | abc  | def  |
+---------------------+------+------+
1 row in set (0.01 sec)

--------------
SELECT * FROM t where match ('@f2 abc')
--------------

+---------------------+------+------+
| id                  | f1   | f2   |
+---------------------+------+------+
| 1515384550488801301 | abc  | def  |
+---------------------+------+------+
1 row in set (0.00 sec)

--------------
show meta
--------------

+----------------+-------+
| Variable_name  | Value |
+----------------+-------+
| total          | 1     |
| total_found    | 1     |
| total_relation | eq    |
| time           | 0.000 |
| keyword[0]     | =abc  |
| docs[0]        | 1     |
| hits[0]        | 2     |
| keyword[1]     | abc   |
| docs[1]        | 1     |
| hits[1]        | 1     |
+----------------+-------+
10 rows in set (0.00 sec)

--------------
call keywords('abc', 't')
--------------

+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | abc       | abc        |
| 1    | abc       | =abc       |
+------+-----------+------------+
2 rows in set (0.00 sec)

--------------
call keywords('def', 't')
--------------

+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | def       | def        |
| 1    | def       | =def       |
+------+-----------+------------+
2 rows in set (0.00 sec)
sanikolaev commented 1 year ago

Can't reproduce at all.

I see what's wrong: you've removed morphology_skip_fields='f2'. As said in the issue:

Removing morphology_skip_fields='f2' eliminates the issue

If you don't remove it, it should reproduce the problem:

mysql> drop table if exists t; create table t (f1 text, f2 text) morphology='lemmatize_de_all' morphology_skip_fields='f2'; replace into t(f1, f2) values ('peter', 'john'); SELECT * FROM t where match ('@f2 peter'); show meta;
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
create table t (f1 text, f2 text) morphology='lemmatize_de_all' morphology_skip_fields='f2'
--------------

Query OK, 0 rows affected, 1 warning (0.01 sec)

--------------
replace into t(f1, f2) values ('peter', 'john')
--------------

Query OK, 1 row affected (0.00 sec)

--------------
SELECT * FROM t where match ('@f2 peter')
--------------

+---------------------+-------+------+
| id                  | f1    | f2   |
+---------------------+-------+------+
| 1515384593958567938 | peter | john |
+---------------------+-------+------+
1 row in set (0.00 sec)

--------------
show meta
--------------

+----------------+--------+
| Variable_name  | Value  |
+----------------+--------+
| total          | 1      |
| total_found    | 1      |
| total_relation | eq     |
| time           | 0.000  |
| keyword[0]     | =peter |
| docs[0]        | 1      |
| hits[0]        | 2      |
| keyword[1]     | peter  |
| docs[1]        | 1      |
| hits[1]        | 1      |
+----------------+--------+
10 rows in set (0.00 sec)
githubmanticore commented 1 year ago

➤ Aleksey N. Vinogradov commented:

If you look carefully, I've tried both, with morphology_skip_fields and without it. Both cases can't reproduce the issue. Tried both M1 mac and intel linux.

FIY, you've just cited 'mre' with skip field, but it've found something, strange. Mine still no success:

mysql> drop table if exists t; create table t (f1 text, f2 text) morphology='lemmatize_de_all' morphology_skip_fields='f2'; replace into t(f1, f2) values ('peter', 'john'); SELECT * FROM t where match ('@f2 peter'); show meta; 
Query OK, 0 rows affected (0,01 sec) 

Query OK, 0 rows affected, 1 warning (0,01 sec) 

Query OK, 1 row affected (0,00 sec) 

Empty set (0,00 sec) 

+----------------+--------+ 
| Variable_name  | Value  | 
+----------------+--------+ 
| total          | 0      | 
| total_found    | 0      | 
| total_relation | eq     | 
| time           | 0.000  | 
| keyword[0]     | =peter | 
| docs[0]        | 1      | 
| hits[0]        | 1      | 
| keyword[1]     | peter  | 
| docs[1]        | 1      | 
| hits[1]        | 1      | 
+----------------+--------+ 
10 rows in set (0,00 sec) 

mysql> drop table if exists t; create table t (f1 text, f2 text) morphology='lemmatize_de_all'; replace into t(f1, f2) values ('peter', 'john'); SELECT * FROM t where match ('@f2 peter'); show meta; 
Query OK, 0 rows affected (0,01 sec) 

Query OK, 0 rows affected (0,01 sec) 

Query OK, 1 row affected (0,00 sec) 

Empty set (0,00 sec) 

+----------------+-------+ 
| Variable_name  | Value | 
+----------------+-------+ 
| total          | 0     | 
| total_found    | 0     | 
| total_relation | eq    | 
| time           | 0.000 | 
| keyword[0]     | peter | 
| docs[0]        | 1     | 
| hits[0]        | 1     | 
+----------------+-------+ 
7 rows in set (0,00 sec) 

morphology_skip_fields affects output of 'show meta', but doesn't cause any fake doc to be found. So, still NOT reproducable.

githubmanticore commented 1 year ago

➤ Aleksey N. Vinogradov commented:

upd. With release build morphology_skip_fields works.

mysql> drop table if exists t; create table t (f1 text, f2 text) morphology='lemmatize_de_all' morphology_skip_fields='f2'; replace into t(f1, f2) values ('peter', 'john'); SELECT * FROM t where match ('@f2 peter'); show meta; 
Query OK, 0 rows affected (0,01 sec) 

Query OK, 0 rows affected, 1 warning (0,00 sec) 

Query OK, 1 row affected (0,00 sec) 

+---------------+-------+------+ 
| id            | f1    | f2   | 
+---------------+-------+------+ 
| 1677721600007 | peter | john | 
+---------------+-------+------+ 
1 row in set (0,00 sec) 

+----------------+--------+ 
| Variable_name  | Value  | 
+----------------+--------+ 
| total          | 1      | 
| total_found    | 1      | 
| total_relation | eq     | 
| time           | 0.000  | 
| keyword[0]     | =peter | 
| docs[0]        | 1      | 
| hits[0]        | 2      | 
| keyword[1]     | peter  | 
| docs[1]        | 1      | 
| hits[1]        | 1      | 
+----------------+--------+ 
10 rows in set (0,00 sec) 

mysql> drop table if exists t; create table t (f1 text, f2 text) morphology='lemmatize_de_all'; replace into t(f1, f2) values ('peter', 'john'); SELECT * FROM t where match ('@f2 peter'); show meta; 
Query OK, 0 rows affected (0,01 sec) 

Query OK, 0 rows affected (0,00 sec) 

Query OK, 1 row affected (0,00 sec) 

Empty set (0,00 sec) 

+----------------+-------+ 
| Variable_name  | Value | 
+----------------+-------+ 
| total          | 0     | 
| total_found    | 0     | 
| total_relation | eq    | 
| time           | 0.000 | 
| keyword[0]     | peter | 
| docs[0]        | 1     | 
| hits[0]        | 1     | 
+----------------+-------+ 
7 rows in set (0,00 sec) 

So, without morphology_skip_fields both builds (debug and release) behave similar and reveal no issue.

However on debug build behavior of morphology_skip_fields differs from release build. That looks like a bug, but not a one reported in the title. I'll try to localize it (usually such difference means that some 'assert' in the code left in wrong place and influent to behaviour depended on type of build.

githubmanticore commented 1 year ago

➤ Aleksey N. Vinogradov commented:

here is guilty code snippet

        if ( m_bIndexExactWords && eMorph!=SPH_TOKEN_MORPH_GUESS ) 
        { 
            auto iBytes = strlen ( (const char*)sWord ); 
            memcpy ( sBuf + 1, sWord, iBytes ); 
            sBuf[0] = MAGIC_WORD_HEAD_NONSTEMMED; 
            sBuf[iBytes+1] = '%CODE_BLOCK%'; 
        } 

        if ( m_bIndexExactWords && ( eMorph==SPH_TOKEN_MORPH_ORIGINAL || bMorphDisabled ) ) 
        { 
            // can not use GetWordID here due to exception vs missed hit, ie 
            // stemmed sWord hasn't got added to hit stream but might be added as exception to dictionary 
            // that causes error at hit sorting phase \ dictionary HitblockPatch 
            if ( !m_pDict->GetSettings().m_bStopwordsUnstemmed ) 
                m_pDict->ApplyStemmers ( sWord ); 

            if ( !m_pDict->IsStopWord ( sWord ) ) 
                m_tHits.Add ( { tRowID, m_pDict->GetWordIDNonStemmed ( sBuf ), m_tState.m_iHitPos } ); 

            m_tState.m_iBuildLastStep = m_pTokenizer->TokenIsBlended() ? 0 : 1; 
            continue; 
        } 

Notice - if we have SPH_TOKEN_MORPH_GUESS - then sBuf is NOT initialized, and contains something strange (garbage). But if bMorphDisabled - it will be injected into dictionary despite it is garbage. And that is the root of the problem.

sanikolaev commented 1 year ago

Works fine now:

➜  ~ mysql -P9306 -h0 -v
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 18
Server version: 6.0.5 da4dc6bdb@230612 dev (columnar 2.0.5 f1a9dd4@230612) (secondary 2.0.5 f1a9dd4@230612) git branch master...origin/master

Copyright (c) 2000, 2023, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Reading history-file /Users/sn/.mysql_history
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> drop table if exists t; create table t (f1 text, f2 text) morphology='lemmatize_de_all' morphology_skip_fields='f2'; replace into t(f1, f2) values ('abc', 'def'); SELECT * FROM t where match ('@f2 abc'); show meta;
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.02 sec)

--------------
create table t (f1 text, f2 text) morphology='lemmatize_de_all' morphology_skip_fields='f2'
--------------

Query OK, 0 rows affected, 1 warning (0.00 sec)

--------------
replace into t(f1, f2) values ('abc', 'def')
--------------

Query OK, 1 row affected (0.00 sec)

--------------
SELECT * FROM t where match ('@f2 abc')
--------------

Empty set (0.01 sec)

--------------
show meta
--------------

+----------------+-------+
| Variable_name  | Value |
+----------------+-------+
| total          | 0     |
| total_found    | 0     |
| total_relation | eq    |
| time           | 0.000 |
| keyword[0]     | =abc  |
| docs[0]        | 1     |
| hits[0]        | 1     |
| keyword[1]     | abc   |
| docs[1]        | 1     |
| hits[1]        | 1     |
+----------------+-------+
10 rows in set (0.00 sec)