manticoresoftware / manticoresearch

Easy to use open source fast database for search | Good alternative to Elasticsearch now | Drop-in replacement for E in the ELK soon
https://manticoresearch.com
GNU General Public License v3.0
9.04k stars 507 forks source link

CALL KEYWORDS / CALL SUGGEST loses umlauts #584

Closed bcat-eu closed 3 years ago

bcat-eu commented 3 years ago

Describe the bug CALL KEYWORDS and CALL SUGGEST results for German words with umlauts do not contain these umlauts which makes these features useless for the case described in the documentation https://manual.manticoresearch.com/Searching/Autocomplete#Autocomplete

To Reproduce Here is the table (I tried changing its settings and also originally used bigram_index):

mysql> SHOW CREATE TABLE phrases;
+---------+-------------------------------------------------------------------------------------------------------------------------------+
| Table   | Create Table                                                                                                                  |
+---------+-------------------------------------------------------------------------------------------------------------------------------+
| phrases | CREATE TABLE phrases (
title text
) min_infix_len='3' index_exact_words='1' morphology='lemmatize_de_all' expand_keywords='1' |
+---------+-------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0,00 sec)

SELECT query delivers some results with umlauts:

mysql> SELECT * FROM phrases WHERE MATCH('Begründ');
+------+------------------------------------------------------------------------------------------------------------------------------+
| id   | title                                                                                                                        |
+------+------------------------------------------------------------------------------------------------------------------------------+
| 2391 | Begründungspflicht eines Unsicherheitszuschlags                                                                              |
| 3237 | Gestaltungen bei Begründung, Erweiterung und Beendigung einer Betriebsaufspaltung                                            |
| 1790 | Nachträgliche Zinsvereinbarung ist wertbegründend                                                                            |
|  456 | Keine Rückstellung für die ausschließlich gesellschaftsvertraglich begründete Pflicht zur Prüfung des Jahresabschlusses      |
+------+------------------------------------------------------------------------------------------------------------------------------+
4 rows in set (0,00 sec)

CALL KEYWORDS query loses them (I tried all the options):

mysql> CALL KEYWORDS('Begründ*', 'phrases', 1 as stats, 'hits' as sort_mode);
+------+-----------+---------------------+------+------+
| qpos | tokenized | normalized          | docs | hits |
+------+-----------+---------------------+------+------+
| 1    | begrund*  | =begrundete         | 1    | 1    |
| 1    | begrund*  | =begrundung         | 1    | 1    |
| 1    | begrund*  | =begrundungspflicht | 1    | 1    |
+------+-----------+---------------------+------+------+
3 rows in set (0,01 sec)

Same with CALL SUGGEST, basically it suggests some wrong words like "begrundung" instead of "Begründung":

mysql> CALL SUGGEST('Begründ*', 'phrases');
+------------+----------+------+
| suggest    | distance | docs |
+------------+----------+------+
| begrenzt   | 3        | 2    |
| bestunde   | 3        | 1    |
| begrundete | 3        | 1    |
| begrundung | 3        | 1    |
| aufgrund   | 4        | 32   |
+------------+----------+------+
5 rows in set (0,00 sec)

Expected behavior Since there is German support I expected to be able to get suggestions with a full set of German letters, considering documentation recommends using these queries for features like autocomplete and spelling correction. Am I doing something wrong or is it a bug / limitation?

Describe the environment:

$ searchd -v
Manticore 3.6.0 96d61d8bf@210504 release
Copyright (c) 2001-2016, Andrew Aksyonoff
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
Copyright (c) 2017-2021, Manticore Software LTD (https://manticoresearch.com)
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.2 LTS
Release:    20.04
Codename:   focal
barryhunter commented 3 years ago

The default charset_table is "non_cjk" . This default charset does 'character folding' for most languages (including German), but excluding the cjk langauges! This folds nearly all 'letters' to plain lowercase chars, to enable 'case+umlaut etc' insensitive searching!

You could just use the charset_table=english. Which only does basic folding in plain ascii (basically only folding uppercase to lowercase). But this does mean all umlauts etc are just treated as separator.

You may need a custom charset_table, which includes all the characters you want to main into the keyword index. Would need to list all the umlauts you want to keep as word charactors, optionally doing it in case insensitive manner.

https://manual.manticoresearch.com/Creating_an_index/NLP_and_tokenization/Low-level_tokenization#charset_table

(you 'see' the umlauts in the result, as the tokenization only happens as part of the keyword indexing processing. ie the keyword inverted index. Stored fields (or string attributes) are stored as is and undergo no transformations. )

sanikolaev commented 3 years ago

charset_table='non_cjk, U+00E4, U+00C4->U+00E4, U+00F6, U+00D6->U+00F6, U+00FC, U+00DC->U+00FC, U+00DF, U+1E9E->U+00DF' should do the trick for the German language since only 4 characters seem to require correction and this just maps capital letters to small letters and defines the small letters themselves (with no mapping), but the important thing to know is that you can override characters in charset_table. With this charset_table here's how different the results will be in your case:

mysql> create table phrases(title text) min_infix_len='3' index_exact_words='1' morphology='lemmatize_de_all' expand_keywords='1' charset_table='non_cjk, U+00E4, U+00C4->U+00E4, U+00F6, U+00D6->U+00F6, U+00FC, U+00DC->U+00FC, U+00DF, U+1E9E->U+00DF';
Query OK, 0 rows affected (0.04 sec)

mysql> insert into phrases values(2391,'Begründungspflicht eines Unsicherheitszuschlags'),(3237,'Gestaltungen bei Begründung, Erweiterung und Beendigung einer Betriebsaufspaltung'),(1790,'Nachträgliche Zinsvereinbarung ist wertbegründend'),(456,'Keine Rückstellung für die ausschließlich gesellschaftsvertraglich begründete Pflicht zur Prüfung des Jahresabschlusses');
Query OK, 4 rows affected (0.00 sec)

mysql> SELECT * FROM phrases WHERE MATCH('Begründ');                                                                                                                        +------+------------------------------------------------------------------------------------------------------------------------------+
| id   | title                                                                                                                        |
+------+------------------------------------------------------------------------------------------------------------------------------+
| 2391 | Begründungspflicht eines Unsicherheitszuschlags                                                                              |
| 3237 | Gestaltungen bei Begründung, Erweiterung und Beendigung einer Betriebsaufspaltung                                            |
| 1790 | Nachträgliche Zinsvereinbarung ist wertbegründend                                                                            |
|  456 | Keine Rückstellung für die ausschließlich gesellschaftsvertraglich begründete Pflicht zur Prüfung des Jahresabschlusses      |
+------+------------------------------------------------------------------------------------------------------------------------------+
4 rows in set (0.00 sec)

mysql> CALL KEYWORDS('Begründ*', 'phrases', 1 as stats, 'hits' as sort_mode);
+------+-----------+----------------------+------+------+
| qpos | tokenized | normalized           | docs | hits |
+------+-----------+----------------------+------+------+
| 1    | begründ*  | =begründete          | 1    | 1    |
| 1    | begründ*  | =begründung          | 1    | 1    |
| 1    | begründ*  | =begründungspflicht  | 1    | 1    |
+------+-----------+----------------------+------+------+
3 rows in set (0.00 sec)

mysql> CALL SUGGEST('Begründ*', 'phrases');
+-------------+----------+------+
| suggest     | distance | docs |
+-------------+----------+------+
| begründung  | 3        | 1    |
| begründete  | 3        | 1    |
+-------------+----------+------+
2 rows in set (0.00 sec)

Is this what you were expecting?

bcat-eu commented 3 years ago

OK it seems I misinterpreted the out of the box functionality, to be precise I assumed that non_cjk charset alias includes all "western" letters but it actually casts them to their "plain" alternatives by default.

Is this what you were expecting?

Yes, your solution works, thank you.

I have also tried the charset definition to keep the capitalization, that works as well.

In case anyone stumbles upon this ticket, case insensitive charset with German umlauts looks like that (as part of the @sanikolaev's suggestion):

charset_table='non_cjk, U+00E4, U+00C4->U+00E4, U+00F6, U+00D6->U+00F6, U+00FC, U+00DC->U+00FC, U+00DF, U+1E9E->U+00DF'

And this one would keep capitalization (German nouns are capitalized and verbs are not so it's kind of useful for autocomplete):

charset_table='non_cjk, A..Z, U+00E4, U+00C4, U+00F6, U+00D6, U+00FC, U+00DC, U+00DF, U+1E9E'

But having capitalization will make the queries case-sensitive.