manticoresoftware / manticoresearch

Easy to use open source fast database for search | Good alternative to Elasticsearch now | Drop-in replacement for E in the ELK soon
https://manticoresearch.com
GNU General Public License v3.0
8.87k stars 490 forks source link

Misaligned weight() values across cluster nodes #1111

Open pavelnemirovsky opened 1 year ago

pavelnemirovsky commented 1 year ago

Describe the bug
We have a cluster consisting of 3 nodes, and I have been running the same command on each node. However, I have noticed that I am getting different scores as results. Could you please provide an explanation for this phenomenon?

To Reproduce
Steps to reproduce the behavior:

# mysql -h manticore-01.dmetrics.internal -P 9306  
mysql> select min(weight()) as min_score, max(weight()) as max_score, count(*) from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200;  
 ----------- ----------- ----------   
| min_score | max_score | count(*) |  
 ----------- ----------- ----------   
|      1602 |      1820 |    35274 |  
 ----------- ----------- ----------   
1 row in set (0.14 sec)  

# mysql -h manticore-02.dmetrics.internal -P 9306  

mysql> select min(weight()) as min_score, max(weight()) as max_score, count(*) from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200;  
 ----------- ----------- ----------   
| min_score | max_score | count(*) |  
 ----------- ----------- ----------   
|      1604 |      1855 |    35274 |  
 ----------- ----------- ----------   
1 row in set (0.14 sec)  

# mysql -h manticore-03.dmetrics.internal -P 9306  
mysql> select min(weight()) as min_score, max(weight()) as max_score, count(*) from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200;  
 ----------- ----------- ----------   
| min_score | max_score | count(*) |  
 ----------- ----------- ----------   
|      1602 |      1752 |    35274 |  
 ----------- ----------- ----------   
1 row in set (0.14 sec)  

Expected behavior
Expected idempotent results across all cluster nodes.

Describe the environment:

tomatolog commented 1 year ago

could you copy index files from nodes where values differ for testing?

Could you check that disk chunks count the same? Could you issue your query without pseudo sharding \ option threads = 1 to check the result ?

pavelnemirovsky commented 1 year ago

@tomatolog

Config:

# https://github.com/manticoresoftware/manticoresearch/blob/master/manual/Server_settings/Searchd.md#node_address
common {
    # https://manual.manticoresearch.com/Server_settings/Common#lemmatizer_base
    lemmatizer_base = /usr/share/manticore/nlp/

    # https://manual.manticoresearch.com/Server_settings/Common#progressive_merge
    # progressive_merge =

    # https://manual.manticoresearch.com/Server_settings/Common#json_autoconv_keynames
    # json_autoconv_keynames =

    # https://manual.manticoresearch.com/Server_settings/Common#json_autoconv_numbers
    # json_autoconv_numbers = 0

    # https://manual.manticoresearch.com/Server_settings/Common#on_json_attr_error
    # on_json_attr_error = ignore_attr

    # plugin_dir =
}

searchd {
  server_id = manticore-03.dmetrics.internal
  listen = 127.0.0.1:9306:mysql
  listen = 127.0.0.1:9308:http
  listen = 10.0.82.150:9312
  listen = 10.0.82.150:9306:mysql
  listen = 10.0.82.150:9308:http
  listen = 10.0.82.150:9360-9370:replication
  pid_file = /var/run/manticore/searchd.pid
  max_packet_size = 128M
  binlog_flush = 1 # ultimate safety, low speed
  query_log_format = sphinxql
  mysql_version_string = 5.0.37
  data_dir = /var/lib/data/manticore
  binlog_path = /var/lib/data/manticore
  log = /var/log/manticore/searchd.log
  query_log = /var/log/manticore/query.log
  attr_flush_period = 60
  # Flushing RT RAM chunks each 5 min
  rt_flush_period = 300
  preopen_indexes = 0
  not_terms_only_allowed = 1
}

Single Thread:

# mysql -h manticore-01.dmetrics.internal -P 9306
mysql> select min(weight()) as min_score, max(weight()) as max_score, count(*) from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 option threads = 1;
+-----------+-----------+----------+
| min_score | max_score | count(*) |
+-----------+-----------+----------+
|      1603 |      1769 |    36392 |
+-----------+-----------+----------+
1 row in set (0.15 sec)

mysql> SHOW INDEX fgi_prod STATUS;
+-----------------------------+------------------------------------------------------------------------------------------------------------+
| Variable_name               | Value                                                                                                      |
+-----------------------------+------------------------------------------------------------------------------------------------------------+
| index_type                  | rt                                                                                                         |
| indexed_documents           | 14139585                                                                                                   |
| indexed_bytes               | 74543841680                                                                                                |
| ram_bytes                   | 19916679448                                                                                                |
| disk_bytes                  | 90059646941                                                                                                |
| disk_mapped                 | 23217024968                                                                                                |
| disk_mapped_cached          | 18921156608                                                                                                |
| disk_mapped_doclists        | 0                                                                                                          |
| disk_mapped_cached_doclists | 0                                                                                                          |
| disk_mapped_hitlists        | 0                                                                                                          |
| disk_mapped_cached_hitlists | 0                                                                                                          |
| killed_documents            | 0                                                                                                          |
| killed_rate                 | 0.00%                                                                                                      |
| ram_chunk                   | 995385488                                                                                                  |
| ram_chunk_segments_count    | 28                                                                                                         |
| disk_chunks                 | 32                                                                                                         |
| mem_limit                   | 2147483648                                                                                                 |
| mem_limit_rate              | 93.72%                                                                                                     |
| ram_bytes_retired           | 0                                                                                                          |
| tid                         | 145999                                                                                                     |
| tid_saved                   | 145999                                                                                                     |
| query_time_1min             | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"}                                   |
| query_time_5min             | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"}                                   |
| query_time_15min            | {"queries":1, "avg_sec":0.001, "min_sec":0.001, "max_sec":0.001, "pct95_sec":0.001, "pct99_sec":0.001}     |
| query_time_total            | {"queries":33512, "avg_sec":0.078, "min_sec":0.001, "max_sec":0.161, "pct95_sec":0.112, "pct99_sec":0.126} |
| found_rows_1min             | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"}                                   |
| found_rows_5min             | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"}                                   |
| found_rows_15min            | {"queries":1, "avg":1, "min":1, "max":1, "pct95":1, "pct99":1}                                             |
| found_rows_total            | {"queries":33512, "avg":98644, "min":0, "max":192349, "pct95":141003, "pct99":156775}                      |
+-----------------------------+------------------------------------------------------------------------------------------------------------+

# mysql -h manticore-02.dmetrics.internal -P 9306
mysql> select min(weight()) as min_score, max(weight()) as max_score, count(*) from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 option threads = 1;
+-----------+-----------+----------+
| min_score | max_score | count(*) |
+-----------+-----------+----------+
|      1604 |      1773 |    36392 |
+-----------+-----------+----------+
1 row in set (0.15 sec)

mysql> SHOW INDEX fgi_prod STATUS;
+-----------------------------+------------------------------------------------------------------------------------------------------------+
| Variable_name               | Value                                                                                                      |
+-----------------------------+------------------------------------------------------------------------------------------------------------+
| index_type                  | rt                                                                                                         |
| indexed_documents           | 14139585                                                                                                   |
| indexed_bytes               | 74543841680                                                                                                |
| ram_bytes                   | 19832797718                                                                                                |
| disk_bytes                  | 90049699488                                                                                                |
| disk_mapped                 | 23232896729                                                                                                |
| disk_mapped_cached          | 18917761024                                                                                                |
| disk_mapped_doclists        | 0                                                                                                          |
| disk_mapped_cached_doclists | 0                                                                                                          |
| disk_mapped_hitlists        | 0                                                                                                          |
| disk_mapped_cached_hitlists | 0                                                                                                          |
| killed_documents            | 0                                                                                                          |
| killed_rate                 | 0.00%                                                                                                      |
| ram_chunk                   | 914899342                                                                                                  |
| ram_chunk_segments_count    | 28                                                                                                         |
| disk_chunks                 | 32                                                                                                         |
| mem_limit                   | 2147483648                                                                                                 |
| mem_limit_rate              | 95.00%                                                                                                     |
| ram_bytes_retired           | 0                                                                                                          |
| tid                         | 145999                                                                                                     |
| tid_saved                   | 145999                                                                                                     |
| query_time_1min             | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"}                                   |
| query_time_5min             | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"}                                   |
| query_time_15min            | {"queries":1, "avg_sec":0.001, "min_sec":0.001, "max_sec":0.001, "pct95_sec":0.001, "pct99_sec":0.001}     |
| query_time_total            | {"queries":32779, "avg_sec":0.087, "min_sec":0.001, "max_sec":0.294, "pct95_sec":0.154, "pct99_sec":0.185} |
| found_rows_1min             | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"}                                   |
| found_rows_5min             | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"}                                   |
| found_rows_15min            | {"queries":1, "avg":1, "min":1, "max":1, "pct95":1, "pct99":1}                                             |
| found_rows_total            | {"queries":32779, "avg":99523, "min":0, "max":205374, "pct95":141003, "pct99":192349}                      |
+-----------------------------+------------------------------------------------------------------------------------------------------------+

# mysql -h manticore-03.dmetrics.internal -P 9306
mysql> select min(weight()) as min_score, max(weight()) as max_score, count(*) from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 option threads = 1;
+-----------+-----------+----------+
| min_score | max_score | count(*) |
+-----------+-----------+----------+
|      1602 |      1782 |    36392 |
+-----------+-----------+----------+
1 row in set (0.15 sec)

+-----------------------------+------------------------------------------------------------------------------------------------------------+
| Variable_name               | Value                                                                                                      |
+-----------------------------+------------------------------------------------------------------------------------------------------------+
| index_type                  | rt                                                                                                         |
| indexed_documents           | 14139586                                                                                                   |
| indexed_bytes               | 74543841680                                                                                                |
| ram_bytes                   | 18465132812                                                                                                |
| disk_bytes                  | 90025432598                                                                                                |
| disk_mapped                 | 23290852358                                                                                                |
| disk_mapped_cached          | 17833938944                                                                                                |
| disk_mapped_doclists        | 0                                                                                                          |
| disk_mapped_cached_doclists | 0                                                                                                          |
| disk_mapped_hitlists        | 0                                                                                                          |
| disk_mapped_cached_hitlists | 0                                                                                                          |
| killed_documents            | 0                                                                                                          |
| killed_rate                 | 0.00%                                                                                                      |
| ram_chunk                   | 631056516                                                                                                  |
| ram_chunk_segments_count    | 29                                                                                                         |
| disk_chunks                 | 32                                                                                                         |
| mem_limit                   | 2147483648                                                                                                 |
| mem_limit_rate              | 93.18%                                                                                                     |
| ram_bytes_retired           | 0                                                                                                          |
| tid                         | 145999                                                                                                     |
| tid_saved                   | 145999                                                                                                     |
| query_time_1min             | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"}                                   |
| query_time_5min             | {"queries":1, "avg_sec":0.002, "min_sec":0.002, "max_sec":0.002, "pct95_sec":0.002, "pct99_sec":0.002}     |
| query_time_15min            | {"queries":2, "avg_sec":0.002, "min_sec":0.002, "max_sec":0.002, "pct95_sec":0.002, "pct99_sec":0.002}     |
| query_time_total            | {"queries":33076, "avg_sec":0.079, "min_sec":0.001, "max_sec":0.213, "pct95_sec":0.117, "pct99_sec":0.130} |
| found_rows_1min             | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"}                                   |
| found_rows_5min             | {"queries":1, "avg":1, "min":1, "max":1, "pct95":1, "pct99":1}                                             |
| found_rows_15min            | {"queries":2, "avg":1, "min":1, "max":1, "pct95":1, "pct99":1}                                             |
| found_rows_total            | {"queries":33076, "avg":97746, "min":0, "max":192349, "pct95":141003, "pct99":156775}                      |
+-----------------------------+------------------------------------------------------------------------------------------------------------+
tomatolog commented 1 year ago

could you issue these queries at all nodes and provide results sets from every node?

select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1603;
select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1769;
select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1604;
select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1773;
select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1602;
select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1782;

seems your index got different data however it is not clear how it got into such state and what is the difference.

pavelnemirovsky commented 1 year ago

@tomatolog

Node #1

root@manticore-01:~# mysql -h0 -P9306

mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1603;
+---------------------+-------+
| id                  | score |
+---------------------+-------+
| 2592984993446342688 |  1603 |
| 4662220671500427390 |  1603 |
| 3129968330644052828 |  1603 |
| 3300479849082479870 |  1603 |
| 4265425771834879270 |  1603 |
| 6161236734625651487 |  1603 |
| 2972218032951538573 |  1603 |
| 2230410452443263952 |  1603 |
| 5426145563682917845 |  1603 |
| 5443436492395652225 |  1603 |
| 8123348241364266641 |  1603 |
| 1209174545758837824 |  1603 |
| 2717219431518028807 |  1603 |
| 1585252769924152895 |  1603 |
| 5552902210274204647 |  1603 |
| 1980593527404316223 |  1603 |
| 1481372055290767673 |  1603 |
| 7427455941233620045 |  1603 |
| 2506307317936132088 |  1603 |
| 8752126551063260322 |  1603 |
+---------------------+-------+
20 rows in set (0.02 sec)

mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1769;
Empty set (0.01 sec)

mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1604;
+---------------------+-------+
| id                  | score |
+---------------------+-------+
| 8200229092940547929 |  1604 |
| 2120942744610469628 |  1604 |
| 6923035898495548477 |  1604 |
| 6601064453919589115 |  1604 |
| 6735606320642119811 |  1604 |
| 2999792835472411662 |  1604 |
| 3654574537062196888 |  1604 |
| 3773264351158118416 |  1604 |
| 8637578311057562584 |  1604 |
| 8794164784460149655 |  1604 |
| 6059877165305848918 |  1604 |
| 3282906016831256748 |  1604 |
| 2612261472663055387 |  1604 |
| 2690877269429919795 |  1604 |
| 2847097192829816046 |  1604 |
| 5942945371882851678 |  1604 |
| 2474220990740776572 |  1604 |
|  733163232415140441 |  1604 |
| 2513753592768196340 |  1604 |
| 8773490989255665275 |  1604 |
+---------------------+-------+
20 rows in set (0.00 sec)

mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1773;
Empty set (0.00 sec)

mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1602;
+---------------------+-------+
| id                  | score |
+---------------------+-------+
| 4873968605738893745 |  1602 |
| 2171153676825932229 |  1602 |
| 3622786119719300160 |  1602 |
| 3049536713905632928 |  1602 |
| 6107640006935923479 |  1602 |
|  801981398053586315 |  1602 |
| 1960698561719059790 |  1602 |
| 1343761795668256013 |  1602 |
| 4622984669984555319 |  1602 |
| 1545080798186495944 |  1602 |
| 2098591267684858063 |  1602 |
| 4497645853033140285 |  1602 |
| 8946551953652792236 |  1602 |
| 3248487345969809604 |  1602 |
| 2919116455063626548 |  1602 |
| 8727190087137721890 |  1602 |
| 2513347835042179078 |  1602 |
| 5636863017215536826 |  1602 |
| 3172480916545321926 |  1602 |
| 7111135492437432234 |  1602 |
+---------------------+-------+
20 rows in set (0.00 sec)

mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1782;
Empty set (0.00 sec)

Node #2

root@manticore-02:~# mysql -h0 -P9306

mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1603;
Empty set (0.01 sec)

mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1769;
Empty set (0.01 sec)

mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1604;
Empty set (0.01 sec)

mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1773;
Empty set (0.00 sec)

mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1602;
+---------------------+-------+
| id                  | score |
+---------------------+-------+
| 8892694543726138219 |  1602 |
|  629335885271522399 |  1602 |
| 1456549821258029774 |  1602 |
| 7249953765864106116 |  1602 |
| 2786707208645515768 |  1602 |
| 5261557492145362518 |  1602 |
| 2833376159662097650 |  1602 |
| 8595731973147110735 |  1602 |
| 2253945765483571809 |  1602 |
| 4997605205840460661 |  1602 |
| 4319783924984481239 |  1602 |
| 4315258191512044085 |  1602 |
|  371245259292705244 |  1602 |
| 4189238752045049628 |  1602 |
| 4177900041508324215 |  1602 |
| 5487895812895329158 |  1602 |
| 1083832472836028683 |  1602 |
| 2432922389897552564 |  1602 |
| 1665163043726892822 |  1602 |
| 1831702418817682063 |  1602 |
+---------------------+-------+
20 rows in set (0.00 sec)

mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1782;
Empty set (0.00 sec)

Node #3

root@manticore-03:~# mysql -h0 -P9306

mysql> ¬select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1603;
ERROR 1064 (42000): sphinxql: syntax error, unexpected $end near '¬select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1603'
mysql>
mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1603;
+---------------------+-------+
| id                  | score |
+---------------------+-------+
| 2592984993446342688 |  1603 |
| 4662220671500427390 |  1603 |
| 3129968330644052828 |  1603 |
| 3300479849082479870 |  1603 |
| 4265425771834879270 |  1603 |
| 6161236734625651487 |  1603 |
| 2972218032951538573 |  1603 |
| 2230410452443263952 |  1603 |
| 5426145563682917845 |  1603 |
| 5443436492395652225 |  1603 |
| 8123348241364266641 |  1603 |
| 1209174545758837824 |  1603 |
| 2717219431518028807 |  1603 |
| 1585252769924152895 |  1603 |
| 5552902210274204647 |  1603 |
| 1980593527404316223 |  1603 |
| 1481372055290767673 |  1603 |
| 7427455941233620045 |  1603 |
| 2506307317936132088 |  1603 |
| 8752126551063260322 |  1603 |
+---------------------+-------+
20 rows in set (0.01 sec)

mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1769;
Empty set (0.01 sec)

mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1604;
+---------------------+-------+
| id                  | score |
+---------------------+-------+
| 3278299437838253517 |  1604 |
|  831363264692743711 |  1604 |
| 4097074020702646877 |  1604 |
|  502386133080193092 |  1604 |
| 2438302740824218206 |  1604 |
| 2434101175870422770 |  1604 |
| 6705524661416092133 |  1604 |
| 1567232811339767193 |  1604 |
| 3999944616016432641 |  1604 |
| 1306726073512346320 |  1604 |
| 5297381330670067881 |  1604 |
| 9017267096864895454 |  1604 |
| 8994162228783625519 |  1604 |
| 7889636414790414500 |  1604 |
| 4636895361539966838 |  1604 |
| 8006576869797910500 |  1604 |
|  430197624393949370 |  1604 |
| 3386716024388158741 |  1604 |
| 4022663738521285142 |  1604 |
| 3264531810452662099 |  1604 |
+---------------------+-------+
20 rows in set (0.00 sec)

mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1773;
Empty set (0.01 sec)

mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1602;
+---------------------+-------+
| id                  | score |
+---------------------+-------+
| 4873968605738893745 |  1602 |
| 2171153676825932229 |  1602 |
| 3622786119719300160 |  1602 |
| 3049536713905632928 |  1602 |
| 6107640006935923479 |  1602 |
|  801981398053586315 |  1602 |
| 1960698561719059790 |  1602 |
| 1343761795668256013 |  1602 |
| 4622984669984555319 |  1602 |
| 1545080798186495944 |  1602 |
| 2098591267684858063 |  1602 |
| 4497645853033140285 |  1602 |
| 8946551953652792236 |  1602 |
| 3248487345969809604 |  1602 |
| 2919116455063626548 |  1602 |
| 8727190087137721890 |  1602 |
| 2513347835042179078 |  1602 |
| 5636863017215536826 |  1602 |
| 3172480916545321926 |  1602 |
| 7111135492437432234 |  1602 |
+---------------------+-------+
20 rows in set (0.01 sec)

mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1782;
Empty set (0.01 sec)
pavelnemirovsky commented 1 year ago

@tomatolog sorry for delay

sanikolaev commented 1 year ago

Document weight is a function of multiple variables including IDF which depends on the number of documents in the table/disk chunk/ram chunk and in case of an RT index it's calculated separately for each disk chunk, so even without replication the same query against the same documents can give different weights and even different documents order depending on the distribution of the documents in the chunks:

mysql> drop table if exists t; create table t(f text); insert into t(f) values('a b c'),('a b'),('a a a a a'); flush ramchunk t; select *, weight() from t where match('a');
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.03 sec)

--------------
create table t(f text)
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
insert into t(f) values('a b c'),('a b'),('a a a a a')
--------------

Query OK, 3 rows affected (0.01 sec)

--------------
flush ramchunk t
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
select *, weight() from t where match('a')
--------------

+---------------------+-----------+----------+
| id                  | f         | weight() |
+---------------------+-----------+----------+
| 1515364055206854670 | a b c     |     1319 |
| 1515364055206854671 | a b       |     1319 |
| 1515364055206854672 | a a a a a |     1180 |
+---------------------+-----------+----------+
3 rows in set (0.01 sec)

mysql> drop table if exists t; create table t(f text); insert into t(f) values('a b c'); flush ramchunk t; insert into t(f) values('a b'); flush ramchunk t; insert into t(f) values('a a a a a'); flush ramchunk t; select *, weight() from t where match('a');
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
create table t(f text)
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
insert into t(f) values('a b c')
--------------

Query OK, 1 row affected (0.00 sec)

--------------
flush ramchunk t
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
insert into t(f) values('a b')
--------------

Query OK, 1 row affected (0.00 sec)

--------------
flush ramchunk t
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
insert into t(f) values('a a a a a')
--------------

Query OK, 1 row affected (0.00 sec)

--------------
flush ramchunk t
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
select *, weight() from t where match('a')
--------------

+---------------------+-----------+----------+
| id                  | f         | weight() |
+---------------------+-----------+----------+
| 1515364055206854675 | a a a a a |     1819 |
| 1515364055206854673 | a b c     |     1680 |
| 1515364055206854674 | a b       |     1680 |
+---------------------+-----------+----------+
3 rows in set (0.00 sec)

We should consider integration of https://manual.manticoresearch.com/Creating_a_table/NLP_and_tokenization/Low-level_tokenization#global_idf into RT indexes.

sanikolaev commented 1 year ago

There's also local_df and it's not working for RT tables:

mysql> drop table if exists t; create table t(f text); insert into t(f) values('a b c'); flush ramchunk t; insert into t(f) values('a b'); flush ramchunk t; insert into t(f) values('a a a a a'); flush ramchunk t; select *, weight() from t where match('a'); select *, weight() from t where match('a') option local_df=1; optimize table t option sync=1, cutoff=1; select *, weight() from t where match('a');
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.04 sec)

--------------
create table t(f text)
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
insert into t(f) values('a b c')
--------------

Query OK, 1 row affected (0.00 sec)

--------------
flush ramchunk t
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
insert into t(f) values('a b')
--------------

Query OK, 1 row affected (0.00 sec)

--------------
flush ramchunk t
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
insert into t(f) values('a a a a a')
--------------

Query OK, 1 row affected (0.00 sec)

--------------
flush ramchunk t
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
select *, weight() from t where match('a')
--------------

+---------------------+-----------+----------+
| id                  | f         | weight() |
+---------------------+-----------+----------+
| 5838818623640043544 | a a a a a |     1819 |
| 5838818623640043542 | a b c     |     1680 |
| 5838818623640043543 | a b       |     1680 |
+---------------------+-----------+----------+
3 rows in set (0.00 sec)

--------------
select *, weight() from t where match('a') option local_df=1
--------------

+---------------------+-----------+----------+
| id                  | f         | weight() |
+---------------------+-----------+----------+
| 5838818623640043544 | a a a a a |     1819 |
| 5838818623640043542 | a b c     |     1680 |
| 5838818623640043543 | a b       |     1680 |
+---------------------+-----------+----------+
3 rows in set (0.00 sec)

--------------
optimize table t option sync=1, cutoff=1
--------------

Query OK, 0 rows affected (0.16 sec)

--------------
select *, weight() from t where match('a')
--------------

+---------------------+-----------+----------+
| id                  | f         | weight() |
+---------------------+-----------+----------+
| 5838818623640043542 | a b c     |     1319 |
| 5838818623640043543 | a b       |     1319 |
| 5838818623640043544 | a a a a a |     1180 |
+---------------------+-----------+----------+
3 rows in set (0.00 sec)

It looks most promising to make it work for RT and perhaps make it a default or expose it as searchd.local_df.

sanikolaev commented 1 month ago

The local_df issue was solved here https://github.com/manticoresoftware/manticoresearch/issues/1436

The global_idf is to be discussed and estimated.

alexiv1965 commented 1 month ago

Steps to reproduce global_idf issue (Manticore v 6.3.0):

  1. Prepare global.idf file from some working index by indextool. The only requirement is - that should be index, different from following test index, but with known keywords with idf values (which should differ to idf of keywords from test index). For the following test the only word 'bag' is needed with some non-zero idf value.
  2. Baseline:
    
    CREATE TABLE products(title text, brand text) index_field_lengths='1' index_exact_words = '1' 
    morphology = 'lemmatize_ru_all,lemmatize_en_all' /*global_idf = '/path/to/global.idf'*/;
    INSERT INTO products(title,brand) VALUES ('Crossbody Bag with Tassel', 'Burberry');
    INSERT INTO products(title,brand) VALUES ('Some other bag', 'Gucci');
    INSERT INTO products(title,brand) VALUES ('Шла собока по рояле', 'Ризеншнауцер');
    INSERT INTO products(title,brand) VALUES ('Шлите апельсины', 'Марокко');

SELECT id, title, brand, weight() as score, packedfactors({no_atc=1, json=1}) as text_features FROM products WHERE MATCH('burberry') LIMIT 0,200 OPTION max_matches=200, idf='plain,tfidf_unnormalized', /global_idf=1,/ ranker=expr('(20.0(1000bm25f(1.2,0.9999,{title=1,brand=2})-500.0))'), / any ranker with bm25 / max_query_time=600;

SELECT id, title, brand, weight() as score, packedfactors({no_atc=1, json=1}) as text_features FROM products WHERE MATCH('bag') LIMIT 0,200 OPTION max_matches=200, idf='plain,tfidf_unnormalized', /global_idf=1,/ ranker=expr('(20.0(1000bm25f(1.2,0.9999,{title=1,brand=2})-500.0))'), / any ranker with bm25 / max_query_time=600;

4. Baseline results (corresponding selects):

id title brand score text_features 1677721600007 Crossbody Bag with Tassel Burberry 11843 {"bm25":891, "bm25a":0.75644487, "field_mask":2, "doc_word_count":1, "fields":[{"field":1, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.86135310, "min_idf":0.43067655, "max_idf":0.43067655, "sum_idf":0.43067655, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":1, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.43067655, "atc":0.000000}], "words":[{"tf":2, "idf":0.43067655}]}

id title brand score text_features 1677721600008 Some other bag Gucci 2960 {"bm25":695, "bm25a":0.63685048, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.43067655, "min_idf":0.21533827, "max_idf":0.21533827, "sum_idf":0.21533827, "min_hit_pos":3, "min_best_span_pos":3, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.21533827, "atc":0.000000}], "words":[{"tf":2, "idf":0.21533827}]}
1677721600007 Crossbody Bag with Tassel Burberry 2631 {"bm25":695, "bm25a":0.62822247, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.43067655, "min_idf":0.21533827, "max_idf":0.21533827, "sum_idf":0.21533827, "min_hit_pos":2, "min_best_span_pos":2, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.21533827, "atc":0.000000}], "words":[{"tf":2, "idf":0.21533827}]}

-- mention idf values in text_features column
5. `DROP TABLE products;`
6. Test:

CREATE TABLE products(title text, brand text) index_field_lengths='1' index_exact_words = '1' morphology = 'lemmatize_ru_all,lemmatize_en_all' global_idf = '/path/to/global.idf'; INSERT INTO products(title,brand) VALUES ('Crossbody Bag with Tassel', 'Burberry'); INSERT INTO products(title,brand) VALUES ('Some other bag', 'Gucci'); INSERT INTO products(title,brand) VALUES ('Шла собока по рояле', 'Ризеншнауцер'); INSERT INTO products(title,brand) VALUES ('Шлите апельсины', 'Марокко');

SELECT id, title, brand, weight() as score, packedfactors({no_atc=1, json=1}) as text_features FROM products WHERE MATCH('burberry') LIMIT 0,200 OPTION max_matches=200, idf='plain,tfidf_unnormalized', global_idf=1, ranker=expr('(20.0(1000bm25f(1.2,0.9999,{title=1,brand=2})-500.0))'), / any ranker with bm25 / max_query_time=600;

SELECT id, title, brand, weight() as score, packedfactors({no_atc=1, json=1}) as text_features FROM products WHERE MATCH('bag') LIMIT 0,200 OPTION max_matches=200, idf='plain,tfidf_unnormalized', global_idf=1, ranker=expr('(20.0(1000bm25f(1.2,0.9999,{title=1,brand=2})-500.0))'), / any ranker with bm25 / max_query_time=600;

7. Test results:

id title brand score text_features 1677721600011 Crossbody Bag with Tassel Burberry 0 {"bm25":500, "bm25a":0.500000, "field_mask":2, "doc_word_count":1, "fields":[{"field":1, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.000000, "min_idf":0.000000, "max_idf":0.000000, "sum_idf":0.000000, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":1, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.000000, "atc":0.000000}], "words":[{"tf":2, "idf":0.000000}]}

id title brand score text_features 1677721600012 Some other bag Gucci 0 {"bm25":500, "bm25a":0.500000, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.000000, "min_idf":0.000000, "max_idf":0.000000, "sum_idf":0.000000, "min_hit_pos":3, "min_best_span_pos":3, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.000000, "atc":0.000000}], "words":[{"tf":2, "idf":0.000000}]} 1677721600011 Crossbody Bag with Tassel Burberry 0 {"bm25":500, "bm25a":0.500000, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.000000, "min_idf":0.000000, "max_idf":0.000000, "sum_idf":0.000000, "min_hit_pos":2, "min_best_span_pos":2, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.000000, "atc":0.000000}], "words":[{"tf":2, "idf":0.000000}]}

-- mention all idf values in text_features are zero.
8. Control:

show create table products;

CREATE TABLE products ( id bigint, title text, brand text ) index_exact_words='1' index_field_lengths='1' morphology='lemmatize_ru_all,lemmatize_en_all'


-- path to global_idf is not even stored in index, so, after restart of `searchd` we'll get absolutely the same results with zero idf.
tomatolog commented 2 weeks ago

I can not reproduce issue you described. I need complete example that I could recreate locally to investigate the issue. I created test table that used for indexing and prepare global_idf gh1111.zip then posted data into MySQL

mysql -u test test < gh1111.sql

then use half of the data for actual table then other half data for global_idf table there some words matched

source src_gidf1
{
  type      = mysql
    sql_host        = 127.0.0.1
    sql_user        = root
    sql_pass        = 
    sql_port        = 3306
    sql_db      = test

  sql_query     = SELECT * FROM products where id<20
}

index gidf1
{
    source      = src_gidf1
    path      = data/gidf1
    charset_table=russian, 0..9, english, _
    index_field_lengths = 1
    index_exact_words = 1
    morphology = lemmatize_ru_all,lemmatize_en_all
    global_idf = gidf2.idf
}

source src_gidf2
{
  type      = mysql
    sql_host        = 127.0.0.1
    sql_user        = root
    sql_pass        = 
    sql_port        = 3306
    sql_db      = test

  sql_query     = SELECT * FROM products where id>20
}

index gidf2
{
    source      = src_gidf2
    path      = data/gidf2
    charset_table=russian, 0..9, english, _
    index_field_lengths = 1
    index_exact_words = 1
    morphology = lemmatize_ru_all,lemmatize_en_all
    global_idf = gidf2.idf
}

then indexed data and created global_idf

indexer.exe -c d1.conf gidf1 gidf2
indextool.exe -c d1.conf --dumpdict gidf1 --stats > gidf1_src.txt
indextool.exe -c d1.conf --dumpdict gidf2 --stats > gidf2_src.txt
indextool.exe -c d1.conf --buildidf gidf1_src.txt --out gidf1.idf
indextool.exe -c d1.conf --buildidf gidf2_src.txt --out gidf2.idf

then for queries with global idf enabled q1111-1.zip I see correct idf values

mysql -h 127.0.0.1 -P 9306 -vvv < q1111-1.sql
--------------
SELECT *, weight() as score, packedfactors({no_atc=1, json=1}) as text_features FROM gidf1 WHERE MATCH('burberry') LIMIT 0,200 OPTION idf='plain,tfidf_unnormalized', global_idf=1, ranker=expr('(20.0*(1000*bm25f(1.2,0.9999,{title=1,brand=2})-500.0))')
--------------

+------+---------------------------+----------+-----------+-----------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id   | title                     | brand    | title_len | brand_len | score | text_features                                                                                                                                                                                                                                                                                                                                                                                                              |
+------+---------------------------+----------+-----------+-----------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|   11 | Crossbody Bag with Tassel | Burberry | 4         | 1         |     0 | {"bm25":891, "bm25a":0.70031464, "field_mask":2, "doc_word_count":1, "fields":[{"field":1, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.86135310, "min_idf":0.43067655, "max_idf":0.43067655, "sum_idf":0.43067655, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":1, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.43067655, "atc":0.000000}], "words":[{"tf":2, "idf":0.43067655}]} |
+------+---------------------------+----------+-----------+-----------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
--- 1 out of 1 results in 0ms ---

--------------
SELECT *, weight() as score, packedfactors({no_atc=1, json=1}) as text_features FROM gidf2 WHERE MATCH('burberry') LIMIT 0,200 OPTION idf='plain,tfidf_unnormalized', global_idf=1, ranker=expr('(20.0*(1000*bm25f(1.2,0.9999,{title=1,brand=2})-500.0))')
--------------

Empty set (0.00 sec)
--- 0 out of 0 results in 0ms ---

--------------
SELECT *, weight() as score, packedfactors({no_atc=1, json=1}) as text_features FROM gidf1 WHERE MATCH('bag') LIMIT 0,200 OPTION idf='plain,tfidf_unnormalized', global_idf=1, ranker=expr('(20.0*(1000*bm25f(1.2,0.9999,{title=1,brand=2})-500.0))')
--------------

+------+---------------------------+----------+-----------+-----------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id   | title                     | brand    | title_len | brand_len | score | text_features                                                                                                                                                                                                                                                                                                                                                                                                              |
+------+---------------------------+----------+-----------+-----------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|   12 | Some other bag            | Gucci    | 3         | 1         |  1149 | {"bm25":632, "bm25a":0.57499516, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.29248124, "min_idf":0.14624062, "max_idf":0.14624062, "sum_idf":0.14624062, "min_hit_pos":3, "min_best_span_pos":3, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.14624062, "atc":0.000000}], "words":[{"tf":2, "idf":0.14624062}]} |
|   11 | Crossbody Bag with Tassel | Burberry | 4         | 1         |   946 | {"bm25":632, "bm25a":0.56801891, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.29248124, "min_idf":0.14624062, "max_idf":0.14624062, "sum_idf":0.14624062, "min_hit_pos":2, "min_best_span_pos":2, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.14624062, "atc":0.000000}], "words":[{"tf":2, "idf":0.14624062}]} |
+------+---------------------------+----------+-----------+-----------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)
--- 2 out of 2 results in 0ms ---

--------------
SELECT *, weight() as score, packedfactors({no_atc=1, json=1}) as text_features FROM gidf2 WHERE MATCH('bag') LIMIT 0,200 OPTION idf='plain,tfidf_unnormalized', global_idf=1, ranker=expr('(20.0*(1000*bm25f(1.2,0.9999,{title=1,brand=2})-500.0))')
--------------

+------+---------------------------+----------+-----------+-----------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id   | title                     | brand    | title_len | brand_len | score | text_features                                                                                                                                                                                                                                                                                                                                                                                                              |
+------+---------------------------+----------+-----------+-----------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|   22 | uome otheu bag            | Gucui    | 3         | 1         |  1149 | {"bm25":632, "bm25a":0.57499516, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.29248124, "min_idf":0.14624062, "max_idf":0.14624062, "sum_idf":0.14624062, "min_hit_pos":3, "min_best_span_pos":3, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.14624062, "atc":0.000000}], "words":[{"tf":2, "idf":0.14624062}]} |
|   21 | Crousbody Bag wuth Tausel | Burbeury | 4         | 1         |   946 | {"bm25":632, "bm25a":0.56801891, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.29248124, "min_idf":0.14624062, "max_idf":0.14624062, "sum_idf":0.14624062, "min_hit_pos":2, "min_best_span_pos":2, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.14624062, "atc":0.000000}], "words":[{"tf":2, "idf":0.14624062}]} |
+------+---------------------------+----------+-----------+-----------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)
--- 2 out of 2 results in 0ms ---

I need complete example with global idf file and source data that I could run locally and investigate issue as for me all works fine now

alexiv1965 commented 1 week ago

Hi, I'm just return from vacations, will prepare response in several days. But it's strange for me: in Manticore 6.3.0 the path to global_idf file is not stored anywhere in index at all, so, index table just do not know that it should use it. There is no global_idf option in create table of gh1111.zip file. The following your example uses indexing tool, so, you use plain tables. But my example deals with real-time tables!

tomatolog commented 1 week ago

you could change my example to get the reproducible case or create your own that shows the issue. For now I see no issue and show you that all works as intend .

alexiv1965 commented 1 week ago

But in my comments here of 31 July there are exact steps to reproduce the problem. Pay attention on p.6: there the problem with real-time table! As to '/path/to/global.idf' - it may be built on any real-time table, even from baseline p.2. since the problem is that real-time table ignores documented real-time table creation option global_idf = '/path/to/global.idf' and do not stores this path anywhere in real-time table. Your example is not about my case completely, since it uses indexer.exe and hence - plain tables, not real-time table.

tomatolog commented 1 week ago

that is why I asked you about complete case as I tried to reproduce the case (I tried the RT indexes and plain indexes) and see no issue. I posted plain indexes case as it has all data to start with and simper setup.

If you see the issue with RT indexes please post all files along with commands or maybe a Docker container that I could run locally and see the issue.

tomatolog commented 1 week ago

the only thing that I tried the case while daemon works in the plain mode and all setup was done via config. Maybe your case related to only RT mode but I ask you to provide complete case with all files or commands to continue investigation.

alexiv1965 commented 1 week ago

OK, a bit later.

alexiv1965 commented 4 days ago

So, improved version of global_idf problem reproduction steps. The problem concerns only to real-time tables.

  1. Baseline (to compare with problem):

    CREATE TABLE products(title text, brand text) index_field_lengths='1' index_exact_words = '1' 
    morphology = 'lemmatize_ru_all,lemmatize_en_all' /*global_idf = '/tmp/global.idf'*/;
    INSERT INTO products(title,brand) VALUES ('Crossbody Bag with Tassel', 'Burberry');
    INSERT INTO products(title,brand) VALUES ('Some other bag', 'Gucci');
    INSERT INTO products(title,brand) VALUES ('Шла собока по рояле', 'Ризеншнауцер');
    INSERT INTO products(title,brand) VALUES ('Шлите апельсины', 'Марокко');
    INSERT INTO products(title,brand) VALUES ('Another bag without Tassel', 'Burberry');
    INSERT INTO products(title,brand) VALUES ('Шлите яблоки', 'Краснодар');
    FLUSH RAMCHUNK products;

    (Take note - no global.idf file is used, real-time table creation)

  2. Baseline check:

    
    SELECT id, title, brand, weight() as score,
    packedfactors({no_atc=1, json=1}) as text_features
    FROM products
    WHERE MATCH('burberry')
    LIMIT 0,200
    OPTION
    max_matches=200,
    idf='plain,tfidf_unnormalized',
    /*global_idf=1,*/
    ranker=expr('(20.0*(1000*bm25f(1.2,0.9999,{title=1,brand=2})-500.0))'), /* any ranker with bm25 */
    max_query_time=600;

SELECT id, title, brand, weight() as score, packedfactors({no_atc=1, json=1}) as text_features FROM products WHERE MATCH('шли') LIMIT 0,200 OPTION max_matches=200, idf='plain,tfidf_unnormalized', /global_idf=1,/ ranker=expr('(20.0(1000bm25f(1.2,0.9999,{title=1,brand=2})-500.0))'), / any ranker with bm25 / max_query_time=600;

(no global.idf in baseline requests)

3. Baseline results:

id title brand score text_features 795472708505698309 Another bag without Tassel Burberry 7762 {"bm25":756, "bm25a":0.66703403, "field_mask":2, "doc_word_count":1, "fields":[{"field":1, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.56457508, "min_idf":0.28228754, "max_idf":0.28228754, "sum_idf":0.28228754, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":1, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.28228754, "atc":0.000000}], "words":[{"tf":2, "idf":0.28228754}]} 795472708505698305 Crossbody Bag with Tassel Burberry 7762 {"bm25":756, "bm25a":0.66703403, "field_mask":2, "doc_word_count":1, "fields":[{"field":1, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.56457508, "min_idf":0.28228754, "max_idf":0.28228754, "sum_idf":0.28228754, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":1, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.28228754, "atc":0.000000}], "words":[{"tf":2, "idf":0.28228754}]}

id title brand score text_features 795472708505698308 Шлите апельсины Марокко 3025 {"bm25":628, "bm25a":0.62708396, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":1, "word_count":1, "tf_idf":0.24755955, "min_idf":0.24755955, "max_idf":0.24755955, "sum_idf":0.24755955, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.24755955, "atc":0.000000}], "words":[{"tf":1, "idf":0.24755955}]} 795472708505698310 Шлите яблоки Краснодар 3025 {"bm25":628, "bm25a":0.62708396, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":1, "word_count":1, "tf_idf":0.24755955, "min_idf":0.24755955, "max_idf":0.24755955, "sum_idf":0.24755955, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.24755955, "atc":0.000000}], "words":[{"tf":1, "idf":0.24755955}]} 795472708505698307 Шла собока по рояле Ризеншнауцер 2094 {"bm25":709, "bm25a":0.60401660, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":1, "word_count":1, "tf_idf":0.24755955, "min_idf":0.24755955, "max_idf":0.24755955, "sum_idf":0.24755955, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.24755955, "atc":0.000000}], "words":[{"tf":1, "idf":0.24755955}]}


(Take note on idf and bm25 reasonable values in text_features column)
alexiv1965 commented 4 days ago

global_idf.zip

  1. Prepare global.idf file from baseline working index by indextool. The only requirement is - that should be index, different from following test index, but with known keywords with idf values (which should differ to idf of keywords from test index).
    ssh manticore_server
    cd /var/lib/manticore
    indextool --dumpdict products/products.0.spi --stats > products.txt
    indextool --buildidf products.txt --out global.idf

    (indextool --dumpdict products --stats (with table name) just didn't work, so undocumented filename option is used. It's another problem in indextool to solve. And indextool also didn't work when there is any other distributed table in manticore.json - it's the third problem)

alexiv1965 commented 4 days ago
  1. DROP TABLE products;

  2. Problem reproduction:

    CREATE TABLE products(title text, brand text) index_field_lengths='1' index_exact_words = '1' 
    morphology = 'lemmatize_ru_all,lemmatize_en_all' global_idf = '/var/lib/manticore/global.idf';
    INSERT INTO products(title,brand) VALUES ('Crossbody Bag with Tassel', 'Burberry');
    INSERT INTO products(title,brand) VALUES ('Some other bag', 'Gucci');
    INSERT INTO products(title,brand) VALUES ('Шла собока по рояле', 'Ризеншнауцер');
    INSERT INTO products(title,brand) VALUES ('Шлите апельсины', 'Марокко');
    FLUSH RAMCHUNK products;

    (Take notice: smaller table, than in p.1, global.idf file is specified)

  3. Problem requests:

    
    SELECT id, title, brand, weight() as score,
    packedfactors({no_atc=1, json=1}) as text_features
    FROM products
    WHERE MATCH('burberry')
    LIMIT 0,200
    OPTION
    max_matches=200,
    idf='plain,tfidf_unnormalized',
    global_idf=1,
    ranker=expr('(20.0*(1000*bm25f(1.2,0.9999,{title=1,brand=2})-500.0))'), /* any ranker with bm25 */
    max_query_time=600;

SELECT id, title, brand, weight() as score, packedfactors({no_atc=1, json=1}) as text_features FROM products WHERE MATCH('шли') LIMIT 0,200 OPTION max_matches=200, idf='plain,tfidf_unnormalized', global_idf=1, ranker=expr('(20.0(1000bm25f(1.2,0.9999,{title=1,brand=2})-500.0))'), / any ranker with bm25 / max_query_time=600;

(Take notice: global_idf option specified in requests)

8. Problem results:

id title brand score text_features 795475224534450181 Crossbody Bag with Tassel Burberry 0 {"bm25":500, "bm25a":0.500000, "field_mask":2, "doc_word_count":1, "fields":[{"field":1, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.000000, "min_idf":0.000000, "max_idf":0.000000, "sum_idf":0.000000, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":1, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.000000, "atc":0.000000}], "words":[{"tf":2, "idf":0.000000}]}

id title brand score text_features 795475224534450183 Шла собока по рояле Ризеншнауцер 0 {"bm25":500, "bm25a":0.500000, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":1, "word_count":1, "tf_idf":0.000000, "min_idf":0.000000, "max_idf":0.000000, "sum_idf":0.000000, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.000000, "atc":0.000000}], "words":[{"tf":1, "idf":0.000000}]} 795475224534450184 Шлите апельсины Марокко 0 {"bm25":500, "bm25a":0.500000, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":1, "word_count":1, "tf_idf":0.000000, "min_idf":0.000000, "max_idf":0.000000, "sum_idf":0.000000, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.000000, "atc":0.000000}], "words":[{"tf":1, "idf":0.000000}]}


(Take notice: all idf values in text_features are zero, bm25 values are significantly different (and erroneous) from baseline)
alexiv1965 commented 4 days ago
  1. Control:
    
    show create table products;

CREATE TABLE products ( id bigint, title text, brand text ) index_exact_words='1' index_field_lengths='1' morphology='lemmatize_ru_all,lemmatize_en_all'


(Take notice: path to global.idf is not even stored in index, so, after restart of searchd we'll get absolutely the same bad results with zero idf.)