manticoresoftware / manticoresearch

Easy to use open source fast database for search | Good alternative to Elasticsearch now | Drop-in replacement for E in the ELK soon
https://manticoresearch.com
GNU General Public License v3.0
9.03k stars 506 forks source link

Looks like wordforms don't indexed right with index_exact_words=1 and min_prefix_len #707

Open asegrenev opened 2 years ago

asegrenev commented 2 years ago

Describe the bug
According to the documentation and Manticoresearch team comments, option index_exact_words = 1 should lead to indexing both forms of word from wordforms file.
But with index_exact_words = 1 and min_prefix_len = 2 the right part of wordform ячеек > ячейка coudn't be found with match('ячейк*') query. (our indexes are without stemming and lemmatization).

To Reproduce

index options:

min_word_len = 2
min_prefix_len = 2
index_exact_words = 1

wordforms line:

ячеек > ячейка

insert line in index:

insert into test(content) values('ячеек');

Then queries:

mysql> select * from test where match('ячеек');


id
5405781461095677953

mysql> select * from test where match('ячейка');


id
5405781461095677953

mysql> select from test where match('ячее');


id
5405781461095677953

But unfortunately:

mysql> select from test where match('ячейк');
Empty set (0.01 sec)

Expected behavior


elect * from test where match('ячейк*');` should find the row.  

**Describe the environment:**  
 - Manticore Search version: 4.0.2  
tomatolog commented 2 years ago

infix \ prefix search use exact \ original token for searching that is why you can not find transformed of stemmed or lemmatized form of token

you could add ячейка > ячейка into wordform file and reindex your case to make sure that search work