manticoresoftware / manticoresearch

Easy to use open source fast database for search | Good alternative to Elasticsearch now | Drop-in replacement for E in the ELK soon
https://manticoresearch.com
GNU General Public License v3.0
9.07k stars 509 forks source link

Global.idf file is not loaded through ALTER TABLE #2763

Open alexiv1965 opened 1 week ago

alexiv1965 commented 1 week ago

Bug Description:

Continued from issue 1111:

I've two files: global.idf, very small, created as in reproduction steps, described in comments to issue 1111:

CREATE TABLE products(title text, brand text) index_field_lengths='1' index_exact_words = '1' 
    morphology = 'lemmatize_ru_all,lemmatize_en_all' global_idf = '/var/lib/manticore/global.idf';
INSERT INTO products(title,brand) VALUES ('Crossbody Bag with Tassel', 'Burberry');
INSERT INTO products(title,brand) VALUES ('Some other bag', 'Gucci');
INSERT INTO products(title,brand) VALUES ('Шла собока по рояле', 'Ризеншнауцер');
INSERT INTO products(title,brand) VALUES ('Шлите апельсины', 'Марокко');
FLUSH RAMCHUNK products;
SHOW TABLE products SETTINGS;
SELECT id, title, brand, weight() as score,
    packedfactors({no_atc=1, json=1}) as text_features
FROM products
WHERE MATCH('burberry')
LIMIT 0,200
OPTION
    max_matches=200,
    idf='plain,tfidf_unnormalized',
    global_idf=1,
    ranker=expr('(20.0*(1000*bm25f(1.2,0.9999,{title=1,brand=2})-500.0))'), /* any ranker with bm25 */
    max_query_time=600;

It produces:

Variable_name   Value                                                                                                                               settings        index_exact_words = 1\nindex_field_lengths = 1\nmorphology = lemmatize_ru_all,lemmatize_en_all\nglobal_idf = /var/lib/manticore/global.idf

id      title   brand   score   text_features                                                                                                       2934594772926465        Crossbody Bag with Tassel       Burberry        7031    {"bm25":732, "bm25a":0.65225315, "field_mask":2, "doc_word_count":1, "fields":[{"field":1, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.51139158, "min_idf":0.25569579, "max_idf":0.25569579, "sum_idf":0.25569579, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":1, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.25569579, "atc":0.000000}], "words":[{"tf":2, "idf":0.25569579}]}

Please, note "idf":0.25569579.

And now I want to change it to larger one, uploaded to s3: manticore/write-only/issue-2739/global_su.idf.gz:

ALTER TABLE products global_idf='/var/lib/manticore/global_su.idf';
SHOW TABLE products SETTINGS;

The last line produces:

settings        index_field_lengths = 1\nmorphology = lemmatize_ru_all,lemmatize_en_all\nglobal_idf = /var/lib/manticore/global_su.idf

So, we can see that global_idf file has changed. But actually it is not loaded:

SELECT id, title, brand, weight() as score,
    packedfactors({no_atc=1, json=1}) as text_features
FROM products
WHERE MATCH('burberry')
LIMIT 0,200
OPTION
    max_matches=200,
    idf='plain,tfidf_unnormalized',
    global_idf=1,
    ranker=expr('(20.0*(1000*bm25f(1.2,0.9999,{title=1,brand=2})-500.0))'), /* any ranker with bm25 */
    max_query_time=600;

produces:

id      title   brand   score   text_features                                                                                                       2934594772926465        Crossbody Bag with Tassel       Burberry        7031    {"bm25":732, "bm25a":0.65225315, "field_mask":2, "doc_word_count":1, "fields":[{"field":1, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.51139158, "min_idf":0.25569579, "max_idf":0.25569579, "sum_idf":0.25569579, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":1, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.25569579, "atc":0.000000}], "words":[{"tf":2, "idf":0.25569579}]}

-- exactly the same idf value.

OK, let's restart manticore and repeat the last SELECT:

id      title   brand   score   text_features                                                                                                       2934594772926465        Crossbody Bag with Tassel       Burberry        10244   {"bm25":838, "bm25a":0.72181255, "field_mask":2, "doc_word_count":1, "fields":[{"field":1, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.74502915, "min_idf":0.37251458, "max_idf":0.37251458, "sum_idf":0.37251458, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":1, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.37251458, "atc":0.000000}], "words":[{"tf":2, "idf":0.37251458}]} 

Now idf has correct value from new file global_su.idf.

Manticore Search Version:

commit 1611667, that fixes 1111

Operating System Version:

ubuntu 22.04

Have you tried the latest development version?

Yes

Internal Checklist:

To be completed by the assignee. Check off tasks that have been completed or are not applicable.

- [ ] Implementation completed - [ ] Tests developed - [ ] Documentation updated - [ ] Documentation reviewed - [ ] [Changelog](https://docs.google.com/spreadsheets/d/1mz_3dRWKs86FjRF7EIZUziUDK_2Hvhd97G0pLpxo05s/edit?pli=1&gid=1102439133) updated
klirichek commented 1 week ago

Does a bit another scenario has a difference? If you 1) create the table. 2) Alter it 3) insert the data 4) try your query ?

(that is - if you alter not AFTER data inserted, but BEFORE, so that insertion came to already altered table? Does it make a difference?)

alexiv1965 commented 1 week ago

Yep, I've reproduced your proposition, and results are exactly the same with my case:

id      title   brand   score   text_features                                                                                                       1372028895200542721     Crossbody Bag with Tassel       Burberry        7031    {"bm25":732, "bm25a":0.65225315, "field_mask":2, "doc_word_count":1, "fields":[{"field":1, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.51139158, "min_idf":0.25569579, "max_idf":0.25569579, "sum_idf":0.25569579, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":1, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.25569579, "atc":0.000000}], "words":[{"tf":2, "idf":0.25569579}]}

I.e. idf is from small global.idf, not altered large one. Only manticore restart helps.