Open pavelnemirovsky opened 1 year ago
could you copy index files from nodes where values differ for testing?
Could you check that disk chunks count the same? Could you issue your query without pseudo sharding \ option threads = 1 to check the result ?
@tomatolog
Config:
# https://github.com/manticoresoftware/manticoresearch/blob/master/manual/Server_settings/Searchd.md#node_address
common {
# https://manual.manticoresearch.com/Server_settings/Common#lemmatizer_base
lemmatizer_base = /usr/share/manticore/nlp/
# https://manual.manticoresearch.com/Server_settings/Common#progressive_merge
# progressive_merge =
# https://manual.manticoresearch.com/Server_settings/Common#json_autoconv_keynames
# json_autoconv_keynames =
# https://manual.manticoresearch.com/Server_settings/Common#json_autoconv_numbers
# json_autoconv_numbers = 0
# https://manual.manticoresearch.com/Server_settings/Common#on_json_attr_error
# on_json_attr_error = ignore_attr
# plugin_dir =
}
searchd {
server_id = manticore-03.dmetrics.internal
listen = 127.0.0.1:9306:mysql
listen = 127.0.0.1:9308:http
listen = 10.0.82.150:9312
listen = 10.0.82.150:9306:mysql
listen = 10.0.82.150:9308:http
listen = 10.0.82.150:9360-9370:replication
pid_file = /var/run/manticore/searchd.pid
max_packet_size = 128M
binlog_flush = 1 # ultimate safety, low speed
query_log_format = sphinxql
mysql_version_string = 5.0.37
data_dir = /var/lib/data/manticore
binlog_path = /var/lib/data/manticore
log = /var/log/manticore/searchd.log
query_log = /var/log/manticore/query.log
attr_flush_period = 60
# Flushing RT RAM chunks each 5 min
rt_flush_period = 300
preopen_indexes = 0
not_terms_only_allowed = 1
}
Single Thread:
# mysql -h manticore-01.dmetrics.internal -P 9306
mysql> select min(weight()) as min_score, max(weight()) as max_score, count(*) from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 option threads = 1;
+-----------+-----------+----------+
| min_score | max_score | count(*) |
+-----------+-----------+----------+
| 1603 | 1769 | 36392 |
+-----------+-----------+----------+
1 row in set (0.15 sec)
mysql> SHOW INDEX fgi_prod STATUS;
+-----------------------------+------------------------------------------------------------------------------------------------------------+
| Variable_name | Value |
+-----------------------------+------------------------------------------------------------------------------------------------------------+
| index_type | rt |
| indexed_documents | 14139585 |
| indexed_bytes | 74543841680 |
| ram_bytes | 19916679448 |
| disk_bytes | 90059646941 |
| disk_mapped | 23217024968 |
| disk_mapped_cached | 18921156608 |
| disk_mapped_doclists | 0 |
| disk_mapped_cached_doclists | 0 |
| disk_mapped_hitlists | 0 |
| disk_mapped_cached_hitlists | 0 |
| killed_documents | 0 |
| killed_rate | 0.00% |
| ram_chunk | 995385488 |
| ram_chunk_segments_count | 28 |
| disk_chunks | 32 |
| mem_limit | 2147483648 |
| mem_limit_rate | 93.72% |
| ram_bytes_retired | 0 |
| tid | 145999 |
| tid_saved | 145999 |
| query_time_1min | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"} |
| query_time_5min | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"} |
| query_time_15min | {"queries":1, "avg_sec":0.001, "min_sec":0.001, "max_sec":0.001, "pct95_sec":0.001, "pct99_sec":0.001} |
| query_time_total | {"queries":33512, "avg_sec":0.078, "min_sec":0.001, "max_sec":0.161, "pct95_sec":0.112, "pct99_sec":0.126} |
| found_rows_1min | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"} |
| found_rows_5min | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"} |
| found_rows_15min | {"queries":1, "avg":1, "min":1, "max":1, "pct95":1, "pct99":1} |
| found_rows_total | {"queries":33512, "avg":98644, "min":0, "max":192349, "pct95":141003, "pct99":156775} |
+-----------------------------+------------------------------------------------------------------------------------------------------------+
# mysql -h manticore-02.dmetrics.internal -P 9306
mysql> select min(weight()) as min_score, max(weight()) as max_score, count(*) from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 option threads = 1;
+-----------+-----------+----------+
| min_score | max_score | count(*) |
+-----------+-----------+----------+
| 1604 | 1773 | 36392 |
+-----------+-----------+----------+
1 row in set (0.15 sec)
mysql> SHOW INDEX fgi_prod STATUS;
+-----------------------------+------------------------------------------------------------------------------------------------------------+
| Variable_name | Value |
+-----------------------------+------------------------------------------------------------------------------------------------------------+
| index_type | rt |
| indexed_documents | 14139585 |
| indexed_bytes | 74543841680 |
| ram_bytes | 19832797718 |
| disk_bytes | 90049699488 |
| disk_mapped | 23232896729 |
| disk_mapped_cached | 18917761024 |
| disk_mapped_doclists | 0 |
| disk_mapped_cached_doclists | 0 |
| disk_mapped_hitlists | 0 |
| disk_mapped_cached_hitlists | 0 |
| killed_documents | 0 |
| killed_rate | 0.00% |
| ram_chunk | 914899342 |
| ram_chunk_segments_count | 28 |
| disk_chunks | 32 |
| mem_limit | 2147483648 |
| mem_limit_rate | 95.00% |
| ram_bytes_retired | 0 |
| tid | 145999 |
| tid_saved | 145999 |
| query_time_1min | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"} |
| query_time_5min | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"} |
| query_time_15min | {"queries":1, "avg_sec":0.001, "min_sec":0.001, "max_sec":0.001, "pct95_sec":0.001, "pct99_sec":0.001} |
| query_time_total | {"queries":32779, "avg_sec":0.087, "min_sec":0.001, "max_sec":0.294, "pct95_sec":0.154, "pct99_sec":0.185} |
| found_rows_1min | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"} |
| found_rows_5min | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"} |
| found_rows_15min | {"queries":1, "avg":1, "min":1, "max":1, "pct95":1, "pct99":1} |
| found_rows_total | {"queries":32779, "avg":99523, "min":0, "max":205374, "pct95":141003, "pct99":192349} |
+-----------------------------+------------------------------------------------------------------------------------------------------------+
# mysql -h manticore-03.dmetrics.internal -P 9306
mysql> select min(weight()) as min_score, max(weight()) as max_score, count(*) from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 option threads = 1;
+-----------+-----------+----------+
| min_score | max_score | count(*) |
+-----------+-----------+----------+
| 1602 | 1782 | 36392 |
+-----------+-----------+----------+
1 row in set (0.15 sec)
+-----------------------------+------------------------------------------------------------------------------------------------------------+
| Variable_name | Value |
+-----------------------------+------------------------------------------------------------------------------------------------------------+
| index_type | rt |
| indexed_documents | 14139586 |
| indexed_bytes | 74543841680 |
| ram_bytes | 18465132812 |
| disk_bytes | 90025432598 |
| disk_mapped | 23290852358 |
| disk_mapped_cached | 17833938944 |
| disk_mapped_doclists | 0 |
| disk_mapped_cached_doclists | 0 |
| disk_mapped_hitlists | 0 |
| disk_mapped_cached_hitlists | 0 |
| killed_documents | 0 |
| killed_rate | 0.00% |
| ram_chunk | 631056516 |
| ram_chunk_segments_count | 29 |
| disk_chunks | 32 |
| mem_limit | 2147483648 |
| mem_limit_rate | 93.18% |
| ram_bytes_retired | 0 |
| tid | 145999 |
| tid_saved | 145999 |
| query_time_1min | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"} |
| query_time_5min | {"queries":1, "avg_sec":0.002, "min_sec":0.002, "max_sec":0.002, "pct95_sec":0.002, "pct99_sec":0.002} |
| query_time_15min | {"queries":2, "avg_sec":0.002, "min_sec":0.002, "max_sec":0.002, "pct95_sec":0.002, "pct99_sec":0.002} |
| query_time_total | {"queries":33076, "avg_sec":0.079, "min_sec":0.001, "max_sec":0.213, "pct95_sec":0.117, "pct99_sec":0.130} |
| found_rows_1min | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"} |
| found_rows_5min | {"queries":1, "avg":1, "min":1, "max":1, "pct95":1, "pct99":1} |
| found_rows_15min | {"queries":2, "avg":1, "min":1, "max":1, "pct95":1, "pct99":1} |
| found_rows_total | {"queries":33076, "avg":97746, "min":0, "max":192349, "pct95":141003, "pct99":156775} |
+-----------------------------+------------------------------------------------------------------------------------------------------------+
could you issue these queries at all nodes and provide results sets from every node?
select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1603;
select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1769;
select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1604;
select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1773;
select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1602;
select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1782;
seems your index got different data however it is not clear how it got into such state and what is the difference.
@tomatolog
Node #1
root@manticore-01:~# mysql -h0 -P9306
mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1603;
+---------------------+-------+
| id | score |
+---------------------+-------+
| 2592984993446342688 | 1603 |
| 4662220671500427390 | 1603 |
| 3129968330644052828 | 1603 |
| 3300479849082479870 | 1603 |
| 4265425771834879270 | 1603 |
| 6161236734625651487 | 1603 |
| 2972218032951538573 | 1603 |
| 2230410452443263952 | 1603 |
| 5426145563682917845 | 1603 |
| 5443436492395652225 | 1603 |
| 8123348241364266641 | 1603 |
| 1209174545758837824 | 1603 |
| 2717219431518028807 | 1603 |
| 1585252769924152895 | 1603 |
| 5552902210274204647 | 1603 |
| 1980593527404316223 | 1603 |
| 1481372055290767673 | 1603 |
| 7427455941233620045 | 1603 |
| 2506307317936132088 | 1603 |
| 8752126551063260322 | 1603 |
+---------------------+-------+
20 rows in set (0.02 sec)
mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1769;
Empty set (0.01 sec)
mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1604;
+---------------------+-------+
| id | score |
+---------------------+-------+
| 8200229092940547929 | 1604 |
| 2120942744610469628 | 1604 |
| 6923035898495548477 | 1604 |
| 6601064453919589115 | 1604 |
| 6735606320642119811 | 1604 |
| 2999792835472411662 | 1604 |
| 3654574537062196888 | 1604 |
| 3773264351158118416 | 1604 |
| 8637578311057562584 | 1604 |
| 8794164784460149655 | 1604 |
| 6059877165305848918 | 1604 |
| 3282906016831256748 | 1604 |
| 2612261472663055387 | 1604 |
| 2690877269429919795 | 1604 |
| 2847097192829816046 | 1604 |
| 5942945371882851678 | 1604 |
| 2474220990740776572 | 1604 |
| 733163232415140441 | 1604 |
| 2513753592768196340 | 1604 |
| 8773490989255665275 | 1604 |
+---------------------+-------+
20 rows in set (0.00 sec)
mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1773;
Empty set (0.00 sec)
mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1602;
+---------------------+-------+
| id | score |
+---------------------+-------+
| 4873968605738893745 | 1602 |
| 2171153676825932229 | 1602 |
| 3622786119719300160 | 1602 |
| 3049536713905632928 | 1602 |
| 6107640006935923479 | 1602 |
| 801981398053586315 | 1602 |
| 1960698561719059790 | 1602 |
| 1343761795668256013 | 1602 |
| 4622984669984555319 | 1602 |
| 1545080798186495944 | 1602 |
| 2098591267684858063 | 1602 |
| 4497645853033140285 | 1602 |
| 8946551953652792236 | 1602 |
| 3248487345969809604 | 1602 |
| 2919116455063626548 | 1602 |
| 8727190087137721890 | 1602 |
| 2513347835042179078 | 1602 |
| 5636863017215536826 | 1602 |
| 3172480916545321926 | 1602 |
| 7111135492437432234 | 1602 |
+---------------------+-------+
20 rows in set (0.00 sec)
mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1782;
Empty set (0.00 sec)
Node #2
root@manticore-02:~# mysql -h0 -P9306
mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1603;
Empty set (0.01 sec)
mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1769;
Empty set (0.01 sec)
mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1604;
Empty set (0.01 sec)
mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1773;
Empty set (0.00 sec)
mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1602;
+---------------------+-------+
| id | score |
+---------------------+-------+
| 8892694543726138219 | 1602 |
| 629335885271522399 | 1602 |
| 1456549821258029774 | 1602 |
| 7249953765864106116 | 1602 |
| 2786707208645515768 | 1602 |
| 5261557492145362518 | 1602 |
| 2833376159662097650 | 1602 |
| 8595731973147110735 | 1602 |
| 2253945765483571809 | 1602 |
| 4997605205840460661 | 1602 |
| 4319783924984481239 | 1602 |
| 4315258191512044085 | 1602 |
| 371245259292705244 | 1602 |
| 4189238752045049628 | 1602 |
| 4177900041508324215 | 1602 |
| 5487895812895329158 | 1602 |
| 1083832472836028683 | 1602 |
| 2432922389897552564 | 1602 |
| 1665163043726892822 | 1602 |
| 1831702418817682063 | 1602 |
+---------------------+-------+
20 rows in set (0.00 sec)
mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1782;
Empty set (0.00 sec)
Node #3
root@manticore-03:~# mysql -h0 -P9306
mysql> ¬select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1603;
ERROR 1064 (42000): sphinxql: syntax error, unexpected $end near '¬select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1603'
mysql>
mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1603;
+---------------------+-------+
| id | score |
+---------------------+-------+
| 2592984993446342688 | 1603 |
| 4662220671500427390 | 1603 |
| 3129968330644052828 | 1603 |
| 3300479849082479870 | 1603 |
| 4265425771834879270 | 1603 |
| 6161236734625651487 | 1603 |
| 2972218032951538573 | 1603 |
| 2230410452443263952 | 1603 |
| 5426145563682917845 | 1603 |
| 5443436492395652225 | 1603 |
| 8123348241364266641 | 1603 |
| 1209174545758837824 | 1603 |
| 2717219431518028807 | 1603 |
| 1585252769924152895 | 1603 |
| 5552902210274204647 | 1603 |
| 1980593527404316223 | 1603 |
| 1481372055290767673 | 1603 |
| 7427455941233620045 | 1603 |
| 2506307317936132088 | 1603 |
| 8752126551063260322 | 1603 |
+---------------------+-------+
20 rows in set (0.01 sec)
mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1769;
Empty set (0.01 sec)
mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1604;
+---------------------+-------+
| id | score |
+---------------------+-------+
| 3278299437838253517 | 1604 |
| 831363264692743711 | 1604 |
| 4097074020702646877 | 1604 |
| 502386133080193092 | 1604 |
| 2438302740824218206 | 1604 |
| 2434101175870422770 | 1604 |
| 6705524661416092133 | 1604 |
| 1567232811339767193 | 1604 |
| 3999944616016432641 | 1604 |
| 1306726073512346320 | 1604 |
| 5297381330670067881 | 1604 |
| 9017267096864895454 | 1604 |
| 8994162228783625519 | 1604 |
| 7889636414790414500 | 1604 |
| 4636895361539966838 | 1604 |
| 8006576869797910500 | 1604 |
| 430197624393949370 | 1604 |
| 3386716024388158741 | 1604 |
| 4022663738521285142 | 1604 |
| 3264531810452662099 | 1604 |
+---------------------+-------+
20 rows in set (0.00 sec)
mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1773;
Empty set (0.01 sec)
mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1602;
+---------------------+-------+
| id | score |
+---------------------+-------+
| 4873968605738893745 | 1602 |
| 2171153676825932229 | 1602 |
| 3622786119719300160 | 1602 |
| 3049536713905632928 | 1602 |
| 6107640006935923479 | 1602 |
| 801981398053586315 | 1602 |
| 1960698561719059790 | 1602 |
| 1343761795668256013 | 1602 |
| 4622984669984555319 | 1602 |
| 1545080798186495944 | 1602 |
| 2098591267684858063 | 1602 |
| 4497645853033140285 | 1602 |
| 8946551953652792236 | 1602 |
| 3248487345969809604 | 1602 |
| 2919116455063626548 | 1602 |
| 8727190087137721890 | 1602 |
| 2513347835042179078 | 1602 |
| 5636863017215536826 | 1602 |
| 3172480916545321926 | 1602 |
| 7111135492437432234 | 1602 |
+---------------------+-------+
20 rows in set (0.01 sec)
mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1782;
Empty set (0.01 sec)
@tomatolog sorry for delay
Document weight is a function of multiple variables including IDF which depends on the number of documents in the table/disk chunk/ram chunk and in case of an RT index it's calculated separately for each disk chunk, so even without replication the same query against the same documents can give different weights and even different documents order depending on the distribution of the documents in the chunks:
mysql> drop table if exists t; create table t(f text); insert into t(f) values('a b c'),('a b'),('a a a a a'); flush ramchunk t; select *, weight() from t where match('a');
--------------
drop table if exists t
--------------
Query OK, 0 rows affected (0.03 sec)
--------------
create table t(f text)
--------------
Query OK, 0 rows affected (0.00 sec)
--------------
insert into t(f) values('a b c'),('a b'),('a a a a a')
--------------
Query OK, 3 rows affected (0.01 sec)
--------------
flush ramchunk t
--------------
Query OK, 0 rows affected (0.01 sec)
--------------
select *, weight() from t where match('a')
--------------
+---------------------+-----------+----------+
| id | f | weight() |
+---------------------+-----------+----------+
| 1515364055206854670 | a b c | 1319 |
| 1515364055206854671 | a b | 1319 |
| 1515364055206854672 | a a a a a | 1180 |
+---------------------+-----------+----------+
3 rows in set (0.01 sec)
mysql> drop table if exists t; create table t(f text); insert into t(f) values('a b c'); flush ramchunk t; insert into t(f) values('a b'); flush ramchunk t; insert into t(f) values('a a a a a'); flush ramchunk t; select *, weight() from t where match('a');
--------------
drop table if exists t
--------------
Query OK, 0 rows affected (0.01 sec)
--------------
create table t(f text)
--------------
Query OK, 0 rows affected (0.01 sec)
--------------
insert into t(f) values('a b c')
--------------
Query OK, 1 row affected (0.00 sec)
--------------
flush ramchunk t
--------------
Query OK, 0 rows affected (0.01 sec)
--------------
insert into t(f) values('a b')
--------------
Query OK, 1 row affected (0.00 sec)
--------------
flush ramchunk t
--------------
Query OK, 0 rows affected (0.01 sec)
--------------
insert into t(f) values('a a a a a')
--------------
Query OK, 1 row affected (0.00 sec)
--------------
flush ramchunk t
--------------
Query OK, 0 rows affected (0.00 sec)
--------------
select *, weight() from t where match('a')
--------------
+---------------------+-----------+----------+
| id | f | weight() |
+---------------------+-----------+----------+
| 1515364055206854675 | a a a a a | 1819 |
| 1515364055206854673 | a b c | 1680 |
| 1515364055206854674 | a b | 1680 |
+---------------------+-----------+----------+
3 rows in set (0.00 sec)
We should consider integration of https://manual.manticoresearch.com/Creating_a_table/NLP_and_tokenization/Low-level_tokenization#global_idf into RT indexes.
There's also local_df and it's not working for RT tables:
mysql> drop table if exists t; create table t(f text); insert into t(f) values('a b c'); flush ramchunk t; insert into t(f) values('a b'); flush ramchunk t; insert into t(f) values('a a a a a'); flush ramchunk t; select *, weight() from t where match('a'); select *, weight() from t where match('a') option local_df=1; optimize table t option sync=1, cutoff=1; select *, weight() from t where match('a');
--------------
drop table if exists t
--------------
Query OK, 0 rows affected (0.04 sec)
--------------
create table t(f text)
--------------
Query OK, 0 rows affected (0.00 sec)
--------------
insert into t(f) values('a b c')
--------------
Query OK, 1 row affected (0.00 sec)
--------------
flush ramchunk t
--------------
Query OK, 0 rows affected (0.00 sec)
--------------
insert into t(f) values('a b')
--------------
Query OK, 1 row affected (0.00 sec)
--------------
flush ramchunk t
--------------
Query OK, 0 rows affected (0.01 sec)
--------------
insert into t(f) values('a a a a a')
--------------
Query OK, 1 row affected (0.00 sec)
--------------
flush ramchunk t
--------------
Query OK, 0 rows affected (0.00 sec)
--------------
select *, weight() from t where match('a')
--------------
+---------------------+-----------+----------+
| id | f | weight() |
+---------------------+-----------+----------+
| 5838818623640043544 | a a a a a | 1819 |
| 5838818623640043542 | a b c | 1680 |
| 5838818623640043543 | a b | 1680 |
+---------------------+-----------+----------+
3 rows in set (0.00 sec)
--------------
select *, weight() from t where match('a') option local_df=1
--------------
+---------------------+-----------+----------+
| id | f | weight() |
+---------------------+-----------+----------+
| 5838818623640043544 | a a a a a | 1819 |
| 5838818623640043542 | a b c | 1680 |
| 5838818623640043543 | a b | 1680 |
+---------------------+-----------+----------+
3 rows in set (0.00 sec)
--------------
optimize table t option sync=1, cutoff=1
--------------
Query OK, 0 rows affected (0.16 sec)
--------------
select *, weight() from t where match('a')
--------------
+---------------------+-----------+----------+
| id | f | weight() |
+---------------------+-----------+----------+
| 5838818623640043542 | a b c | 1319 |
| 5838818623640043543 | a b | 1319 |
| 5838818623640043544 | a a a a a | 1180 |
+---------------------+-----------+----------+
3 rows in set (0.00 sec)
It looks most promising to make it work for RT and perhaps make it a default or expose it as searchd.local_df
.
The local_df issue was solved here https://github.com/manticoresoftware/manticoresearch/issues/1436
The global_idf is to be discussed and estimated.
Steps to reproduce global_idf issue (Manticore v 6.3.0):
CREATE TABLE products(title text, brand text) index_field_lengths='1' index_exact_words = '1'
morphology = 'lemmatize_ru_all,lemmatize_en_all' /*global_idf = '/path/to/global.idf'*/;
INSERT INTO products(title,brand) VALUES ('Crossbody Bag with Tassel', 'Burberry');
INSERT INTO products(title,brand) VALUES ('Some other bag', 'Gucci');
INSERT INTO products(title,brand) VALUES ('Шла собока по рояле', 'Ризеншнауцер');
INSERT INTO products(title,brand) VALUES ('Шлите апельсины', 'Марокко');
SELECT id, title, brand, weight() as score, packedfactors({no_atc=1, json=1}) as text_features FROM products WHERE MATCH('burberry') LIMIT 0,200 OPTION max_matches=200, idf='plain,tfidf_unnormalized', /global_idf=1,/ ranker=expr('(20.0(1000bm25f(1.2,0.9999,{title=1,brand=2})-500.0))'), / any ranker with bm25 / max_query_time=600;
SELECT id, title, brand, weight() as score, packedfactors({no_atc=1, json=1}) as text_features FROM products WHERE MATCH('bag') LIMIT 0,200 OPTION max_matches=200, idf='plain,tfidf_unnormalized', /global_idf=1,/ ranker=expr('(20.0(1000bm25f(1.2,0.9999,{title=1,brand=2})-500.0))'), / any ranker with bm25 / max_query_time=600;
4. Baseline results (corresponding selects):
id title brand score text_features 1677721600007 Crossbody Bag with Tassel Burberry 11843 {"bm25":891, "bm25a":0.75644487, "field_mask":2, "doc_word_count":1, "fields":[{"field":1, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.86135310, "min_idf":0.43067655, "max_idf":0.43067655, "sum_idf":0.43067655, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":1, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.43067655, "atc":0.000000}], "words":[{"tf":2, "idf":0.43067655}]}
id title brand score text_features
1677721600008 Some other bag Gucci 2960 {"bm25":695, "bm25a":0.63685048, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.43067655, "min_idf":0.21533827, "max_idf":0.21533827, "sum_idf":0.21533827, "min_hit_pos":3, "min_best_span_pos":3, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.21533827, "atc":0.000000}], "words":[{"tf":2, "idf":0.21533827}]}
1677721600007 Crossbody Bag with Tassel Burberry 2631 {"bm25":695, "bm25a":0.62822247, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.43067655, "min_idf":0.21533827, "max_idf":0.21533827, "sum_idf":0.21533827, "min_hit_pos":2, "min_best_span_pos":2, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.21533827, "atc":0.000000}], "words":[{"tf":2, "idf":0.21533827}]}
-- mention idf values in text_features column
5. `DROP TABLE products;`
6. Test:
CREATE TABLE products(title text, brand text) index_field_lengths='1' index_exact_words = '1' morphology = 'lemmatize_ru_all,lemmatize_en_all' global_idf = '/path/to/global.idf'; INSERT INTO products(title,brand) VALUES ('Crossbody Bag with Tassel', 'Burberry'); INSERT INTO products(title,brand) VALUES ('Some other bag', 'Gucci'); INSERT INTO products(title,brand) VALUES ('Шла собока по рояле', 'Ризеншнауцер'); INSERT INTO products(title,brand) VALUES ('Шлите апельсины', 'Марокко');
SELECT id, title, brand, weight() as score, packedfactors({no_atc=1, json=1}) as text_features FROM products WHERE MATCH('burberry') LIMIT 0,200 OPTION max_matches=200, idf='plain,tfidf_unnormalized', global_idf=1, ranker=expr('(20.0(1000bm25f(1.2,0.9999,{title=1,brand=2})-500.0))'), / any ranker with bm25 / max_query_time=600;
SELECT id, title, brand, weight() as score, packedfactors({no_atc=1, json=1}) as text_features FROM products WHERE MATCH('bag') LIMIT 0,200 OPTION max_matches=200, idf='plain,tfidf_unnormalized', global_idf=1, ranker=expr('(20.0(1000bm25f(1.2,0.9999,{title=1,brand=2})-500.0))'), / any ranker with bm25 / max_query_time=600;
7. Test results:
id title brand score text_features 1677721600011 Crossbody Bag with Tassel Burberry 0 {"bm25":500, "bm25a":0.500000, "field_mask":2, "doc_word_count":1, "fields":[{"field":1, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.000000, "min_idf":0.000000, "max_idf":0.000000, "sum_idf":0.000000, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":1, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.000000, "atc":0.000000}], "words":[{"tf":2, "idf":0.000000}]}
id title brand score text_features 1677721600012 Some other bag Gucci 0 {"bm25":500, "bm25a":0.500000, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.000000, "min_idf":0.000000, "max_idf":0.000000, "sum_idf":0.000000, "min_hit_pos":3, "min_best_span_pos":3, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.000000, "atc":0.000000}], "words":[{"tf":2, "idf":0.000000}]} 1677721600011 Crossbody Bag with Tassel Burberry 0 {"bm25":500, "bm25a":0.500000, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.000000, "min_idf":0.000000, "max_idf":0.000000, "sum_idf":0.000000, "min_hit_pos":2, "min_best_span_pos":2, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.000000, "atc":0.000000}], "words":[{"tf":2, "idf":0.000000}]}
-- mention all idf values in text_features are zero.
8. Control:
show create table products;
CREATE TABLE products ( id bigint, title text, brand text ) index_exact_words='1' index_field_lengths='1' morphology='lemmatize_ru_all,lemmatize_en_all'
-- path to global_idf is not even stored in index, so, after restart of `searchd` we'll get absolutely the same results with zero idf.
I can not reproduce issue you described. I need complete example that I could recreate locally to investigate the issue. I created test table that used for indexing and prepare global_idf gh1111.zip then posted data into MySQL
mysql -u test test < gh1111.sql
then use half of the data for actual table then other half data for global_idf table there some words matched
source src_gidf1
{
type = mysql
sql_host = 127.0.0.1
sql_user = root
sql_pass =
sql_port = 3306
sql_db = test
sql_query = SELECT * FROM products where id<20
}
index gidf1
{
source = src_gidf1
path = data/gidf1
charset_table=russian, 0..9, english, _
index_field_lengths = 1
index_exact_words = 1
morphology = lemmatize_ru_all,lemmatize_en_all
global_idf = gidf2.idf
}
source src_gidf2
{
type = mysql
sql_host = 127.0.0.1
sql_user = root
sql_pass =
sql_port = 3306
sql_db = test
sql_query = SELECT * FROM products where id>20
}
index gidf2
{
source = src_gidf2
path = data/gidf2
charset_table=russian, 0..9, english, _
index_field_lengths = 1
index_exact_words = 1
morphology = lemmatize_ru_all,lemmatize_en_all
global_idf = gidf2.idf
}
then indexed data and created global_idf
indexer.exe -c d1.conf gidf1 gidf2
indextool.exe -c d1.conf --dumpdict gidf1 --stats > gidf1_src.txt
indextool.exe -c d1.conf --dumpdict gidf2 --stats > gidf2_src.txt
indextool.exe -c d1.conf --buildidf gidf1_src.txt --out gidf1.idf
indextool.exe -c d1.conf --buildidf gidf2_src.txt --out gidf2.idf
then for queries with global idf enabled q1111-1.zip I see correct idf values
mysql -h 127.0.0.1 -P 9306 -vvv < q1111-1.sql
--------------
SELECT *, weight() as score, packedfactors({no_atc=1, json=1}) as text_features FROM gidf1 WHERE MATCH('burberry') LIMIT 0,200 OPTION idf='plain,tfidf_unnormalized', global_idf=1, ranker=expr('(20.0*(1000*bm25f(1.2,0.9999,{title=1,brand=2})-500.0))')
--------------
+------+---------------------------+----------+-----------+-----------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id | title | brand | title_len | brand_len | score | text_features |
+------+---------------------------+----------+-----------+-----------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 11 | Crossbody Bag with Tassel | Burberry | 4 | 1 | 0 | {"bm25":891, "bm25a":0.70031464, "field_mask":2, "doc_word_count":1, "fields":[{"field":1, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.86135310, "min_idf":0.43067655, "max_idf":0.43067655, "sum_idf":0.43067655, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":1, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.43067655, "atc":0.000000}], "words":[{"tf":2, "idf":0.43067655}]} |
+------+---------------------------+----------+-----------+-----------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
--- 1 out of 1 results in 0ms ---
--------------
SELECT *, weight() as score, packedfactors({no_atc=1, json=1}) as text_features FROM gidf2 WHERE MATCH('burberry') LIMIT 0,200 OPTION idf='plain,tfidf_unnormalized', global_idf=1, ranker=expr('(20.0*(1000*bm25f(1.2,0.9999,{title=1,brand=2})-500.0))')
--------------
Empty set (0.00 sec)
--- 0 out of 0 results in 0ms ---
--------------
SELECT *, weight() as score, packedfactors({no_atc=1, json=1}) as text_features FROM gidf1 WHERE MATCH('bag') LIMIT 0,200 OPTION idf='plain,tfidf_unnormalized', global_idf=1, ranker=expr('(20.0*(1000*bm25f(1.2,0.9999,{title=1,brand=2})-500.0))')
--------------
+------+---------------------------+----------+-----------+-----------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id | title | brand | title_len | brand_len | score | text_features |
+------+---------------------------+----------+-----------+-----------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 12 | Some other bag | Gucci | 3 | 1 | 1149 | {"bm25":632, "bm25a":0.57499516, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.29248124, "min_idf":0.14624062, "max_idf":0.14624062, "sum_idf":0.14624062, "min_hit_pos":3, "min_best_span_pos":3, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.14624062, "atc":0.000000}], "words":[{"tf":2, "idf":0.14624062}]} |
| 11 | Crossbody Bag with Tassel | Burberry | 4 | 1 | 946 | {"bm25":632, "bm25a":0.56801891, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.29248124, "min_idf":0.14624062, "max_idf":0.14624062, "sum_idf":0.14624062, "min_hit_pos":2, "min_best_span_pos":2, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.14624062, "atc":0.000000}], "words":[{"tf":2, "idf":0.14624062}]} |
+------+---------------------------+----------+-----------+-----------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)
--- 2 out of 2 results in 0ms ---
--------------
SELECT *, weight() as score, packedfactors({no_atc=1, json=1}) as text_features FROM gidf2 WHERE MATCH('bag') LIMIT 0,200 OPTION idf='plain,tfidf_unnormalized', global_idf=1, ranker=expr('(20.0*(1000*bm25f(1.2,0.9999,{title=1,brand=2})-500.0))')
--------------
+------+---------------------------+----------+-----------+-----------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id | title | brand | title_len | brand_len | score | text_features |
+------+---------------------------+----------+-----------+-----------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 22 | uome otheu bag | Gucui | 3 | 1 | 1149 | {"bm25":632, "bm25a":0.57499516, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.29248124, "min_idf":0.14624062, "max_idf":0.14624062, "sum_idf":0.14624062, "min_hit_pos":3, "min_best_span_pos":3, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.14624062, "atc":0.000000}], "words":[{"tf":2, "idf":0.14624062}]} |
| 21 | Crousbody Bag wuth Tausel | Burbeury | 4 | 1 | 946 | {"bm25":632, "bm25a":0.56801891, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.29248124, "min_idf":0.14624062, "max_idf":0.14624062, "sum_idf":0.14624062, "min_hit_pos":2, "min_best_span_pos":2, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.14624062, "atc":0.000000}], "words":[{"tf":2, "idf":0.14624062}]} |
+------+---------------------------+----------+-----------+-----------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)
--- 2 out of 2 results in 0ms ---
I need complete example with global idf file and source data that I could run locally and investigate issue as for me all works fine now
Hi, I'm just return from vacations, will prepare response in several days. But it's strange for me: in Manticore 6.3.0 the path to global_idf file is not stored anywhere in index at all, so, index table just do not know that it should use it. There is no global_idf option in create table of gh1111.zip file. The following your example uses indexing tool, so, you use plain tables. But my example deals with real-time tables!
you could change my example to get the reproducible case or create your own that shows the issue. For now I see no issue and show you that all works as intend .
But in my comments here of 31 July there are exact steps to reproduce the problem. Pay attention on p.6: there the problem with real-time table! As to '/path/to/global.idf' - it may be built on any real-time table, even from baseline p.2. since the problem is that real-time table ignores documented real-time table creation option global_idf = '/path/to/global.idf' and do not stores this path anywhere in real-time table. Your example is not about my case completely, since it uses indexer.exe and hence - plain tables, not real-time table.
that is why I asked you about complete case as I tried to reproduce the case (I tried the RT indexes and plain indexes) and see no issue. I posted plain indexes case as it has all data to start with and simper setup.
If you see the issue with RT indexes please post all files along with commands or maybe a Docker container that I could run locally and see the issue.
the only thing that I tried the case while daemon works in the plain mode and all setup was done via config. Maybe your case related to only RT mode but I ask you to provide complete case with all files or commands to continue investigation.
OK, a bit later.
So, improved version of global_idf problem reproduction steps. The problem concerns only to real-time tables.
Baseline (to compare with problem):
CREATE TABLE products(title text, brand text) index_field_lengths='1' index_exact_words = '1'
morphology = 'lemmatize_ru_all,lemmatize_en_all' /*global_idf = '/tmp/global.idf'*/;
INSERT INTO products(title,brand) VALUES ('Crossbody Bag with Tassel', 'Burberry');
INSERT INTO products(title,brand) VALUES ('Some other bag', 'Gucci');
INSERT INTO products(title,brand) VALUES ('Шла собока по рояле', 'Ризеншнауцер');
INSERT INTO products(title,brand) VALUES ('Шлите апельсины', 'Марокко');
INSERT INTO products(title,brand) VALUES ('Another bag without Tassel', 'Burberry');
INSERT INTO products(title,brand) VALUES ('Шлите яблоки', 'Краснодар');
FLUSH RAMCHUNK products;
(Take note - no global.idf file is used, real-time table creation)
Baseline check:
SELECT id, title, brand, weight() as score,
packedfactors({no_atc=1, json=1}) as text_features
FROM products
WHERE MATCH('burberry')
LIMIT 0,200
OPTION
max_matches=200,
idf='plain,tfidf_unnormalized',
/*global_idf=1,*/
ranker=expr('(20.0*(1000*bm25f(1.2,0.9999,{title=1,brand=2})-500.0))'), /* any ranker with bm25 */
max_query_time=600;
SELECT id, title, brand, weight() as score, packedfactors({no_atc=1, json=1}) as text_features FROM products WHERE MATCH('шли') LIMIT 0,200 OPTION max_matches=200, idf='plain,tfidf_unnormalized', /global_idf=1,/ ranker=expr('(20.0(1000bm25f(1.2,0.9999,{title=1,brand=2})-500.0))'), / any ranker with bm25 / max_query_time=600;
(no global.idf in baseline requests)
3. Baseline results:
id title brand score text_features 795472708505698309 Another bag without Tassel Burberry 7762 {"bm25":756, "bm25a":0.66703403, "field_mask":2, "doc_word_count":1, "fields":[{"field":1, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.56457508, "min_idf":0.28228754, "max_idf":0.28228754, "sum_idf":0.28228754, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":1, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.28228754, "atc":0.000000}], "words":[{"tf":2, "idf":0.28228754}]} 795472708505698305 Crossbody Bag with Tassel Burberry 7762 {"bm25":756, "bm25a":0.66703403, "field_mask":2, "doc_word_count":1, "fields":[{"field":1, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.56457508, "min_idf":0.28228754, "max_idf":0.28228754, "sum_idf":0.28228754, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":1, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.28228754, "atc":0.000000}], "words":[{"tf":2, "idf":0.28228754}]}
id title brand score text_features 795472708505698308 Шлите апельсины Марокко 3025 {"bm25":628, "bm25a":0.62708396, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":1, "word_count":1, "tf_idf":0.24755955, "min_idf":0.24755955, "max_idf":0.24755955, "sum_idf":0.24755955, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.24755955, "atc":0.000000}], "words":[{"tf":1, "idf":0.24755955}]} 795472708505698310 Шлите яблоки Краснодар 3025 {"bm25":628, "bm25a":0.62708396, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":1, "word_count":1, "tf_idf":0.24755955, "min_idf":0.24755955, "max_idf":0.24755955, "sum_idf":0.24755955, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.24755955, "atc":0.000000}], "words":[{"tf":1, "idf":0.24755955}]} 795472708505698307 Шла собока по рояле Ризеншнауцер 2094 {"bm25":709, "bm25a":0.60401660, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":1, "word_count":1, "tf_idf":0.24755955, "min_idf":0.24755955, "max_idf":0.24755955, "sum_idf":0.24755955, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.24755955, "atc":0.000000}], "words":[{"tf":1, "idf":0.24755955}]}
(Take note on idf and bm25 reasonable values in text_features column)
ssh manticore_server
cd /var/lib/manticore
indextool --dumpdict products/products.0.spi --stats > products.txt
indextool --buildidf products.txt --out global.idf
(indextool --dumpdict products --stats (with table name) just didn't work, so undocumented filename option is used. It's another problem in indextool to solve. And indextool also didn't work when there is any other distributed table in manticore.json - it's the third problem)
DROP TABLE products;
Problem reproduction:
CREATE TABLE products(title text, brand text) index_field_lengths='1' index_exact_words = '1'
morphology = 'lemmatize_ru_all,lemmatize_en_all' global_idf = '/var/lib/manticore/global.idf';
INSERT INTO products(title,brand) VALUES ('Crossbody Bag with Tassel', 'Burberry');
INSERT INTO products(title,brand) VALUES ('Some other bag', 'Gucci');
INSERT INTO products(title,brand) VALUES ('Шла собока по рояле', 'Ризеншнауцер');
INSERT INTO products(title,brand) VALUES ('Шлите апельсины', 'Марокко');
FLUSH RAMCHUNK products;
(Take notice: smaller table, than in p.1, global.idf file is specified)
Problem requests:
SELECT id, title, brand, weight() as score,
packedfactors({no_atc=1, json=1}) as text_features
FROM products
WHERE MATCH('burberry')
LIMIT 0,200
OPTION
max_matches=200,
idf='plain,tfidf_unnormalized',
global_idf=1,
ranker=expr('(20.0*(1000*bm25f(1.2,0.9999,{title=1,brand=2})-500.0))'), /* any ranker with bm25 */
max_query_time=600;
SELECT id, title, brand, weight() as score, packedfactors({no_atc=1, json=1}) as text_features FROM products WHERE MATCH('шли') LIMIT 0,200 OPTION max_matches=200, idf='plain,tfidf_unnormalized', global_idf=1, ranker=expr('(20.0(1000bm25f(1.2,0.9999,{title=1,brand=2})-500.0))'), / any ranker with bm25 / max_query_time=600;
(Take notice: global_idf option specified in requests)
8. Problem results:
id title brand score text_features 795475224534450181 Crossbody Bag with Tassel Burberry 0 {"bm25":500, "bm25a":0.500000, "field_mask":2, "doc_word_count":1, "fields":[{"field":1, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.000000, "min_idf":0.000000, "max_idf":0.000000, "sum_idf":0.000000, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":1, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.000000, "atc":0.000000}], "words":[{"tf":2, "idf":0.000000}]}
id title brand score text_features 795475224534450183 Шла собока по рояле Ризеншнауцер 0 {"bm25":500, "bm25a":0.500000, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":1, "word_count":1, "tf_idf":0.000000, "min_idf":0.000000, "max_idf":0.000000, "sum_idf":0.000000, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.000000, "atc":0.000000}], "words":[{"tf":1, "idf":0.000000}]} 795475224534450184 Шлите апельсины Марокко 0 {"bm25":500, "bm25a":0.500000, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":1, "word_count":1, "tf_idf":0.000000, "min_idf":0.000000, "max_idf":0.000000, "sum_idf":0.000000, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.000000, "atc":0.000000}], "words":[{"tf":1, "idf":0.000000}]}
(Take notice: all idf values in text_features are zero, bm25 values are significantly different (and erroneous) from baseline)
show create table products;
CREATE TABLE products ( id bigint, title text, brand text ) index_exact_words='1' index_field_lengths='1' morphology='lemmatize_ru_all,lemmatize_en_all'
(Take notice: path to global.idf is not even stored in index, so, after restart of searchd we'll get absolutely the same bad results with zero idf.)
Describe the bug
We have a cluster consisting of 3 nodes, and I have been running the same command on each node. However, I have noticed that I am getting different scores as results. Could you please provide an explanation for this phenomenon?
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Expected idempotent results across all cluster nodes.
Describe the environment:
bin/searchd -v
orbin/indexer -v
):uname -a
if on a Unix-like system): Linux manticore-03.dmetrics.internal 5.4.0-1097-aws #105~18.04.1-Ubuntu SMP Mon Feb 13 17:50:57 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux