manticoresoftware / manticoresearch

Easy to use open source fast database for search | Good alternative to Elasticsearch now | Drop-in replacement for E in the ELK soon
https://manticoresearch.com
GNU General Public License v3.0
8.98k stars 499 forks source link

make levenshtein() multibyte safe #640

Open sanikolaev opened 3 years ago

sanikolaev commented 3 years ago

levenshtein() seems to be not multibyte safe:

mysql> select levenshtein('п', 'п');  
 -------------------------   
| levenshtein('п', 'п')   |  
 -------------------------   
| 0                       |  
 -------------------------   
1 row in set (0.00 sec)  

mysql> select levenshtein('п', 'б');  
 -------------------------   
| levenshtein('п', 'б')   |  
 -------------------------   
| 1                       |  
 -------------------------   
1 row in set (0.00 sec)  

mysql> select levenshtein('п', 'р');  
 -------------------------   
| levenshtein('п', 'р')   |  
 -------------------------   
| 2                       |  
 -------------------------   
1 row in set (0.00 sec)  

It's not uncommon, e.g. in php it works similarly, but in Manticore as a database with rich full-text capabilities it makes sense to make it multibyte safe.

Related thread on forum https://forum.manticoresearch.com/t/levenshtein/878

githubmanticore commented 3 years ago

➤ Stan commented:

we already have 2 variants of sphLevenshtein function for sbcs and utf8 and at Expr_Levenshtein_c uses sbcs variant

We could switch to utf8 sphLevenshtein function and convert both arguments into utf8 format (as one of argument could be a string attribute and we can not select appropriate sbsc or utf8 variant prior to calling this expression with actual data). utf8 sphLevenshtein will work well with either utf8 source data or sbsc source data.

However conversion of incoming data into utf8 could slow down the expression evalution.

However sphLevenshtein is itself is not fast and have option for early out that is why additional utf8 conversion could be insignificant or we could add a new option for force sphLevenshtein variant, like `SELECT LEVENSHTEIN(title, j.name, {normalize=1, source_data='sbsc'}) AS dist, ...```

hrustbb2 commented 2 years ago

У вас и suggest не работает с многобатовыми кодировками

sanikolaev commented 2 years ago

У вас и suggest не работает с многобатовыми кодировками

Нужен конкретный пример и отдельное issue. У меня работает:

mysql> drop table if exists t; create table t(f text) charset_table='cjk,non_cjk' min_infix_len='2'; insert into t values(0,'比较苹果和橙子'); call suggest('比较苹和橙子','t');
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.28 sec)

--------------
create table t(f text) charset_table='cjk,non_cjk' min_infix_len='2'
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
insert into t values(0,'比较苹果和橙子')
--------------

Query OK, 1 row affected (0.00 sec)

--------------
call suggest('比较苹和橙子','t')
--------------

+-----------------------+----------+------+
| suggest               | distance | docs |
+-----------------------+----------+------+
| 比较苹果和橙子        | 1        | 1    |
+-----------------------+----------+------+
1 row in set (0.00 sec)