manticoresoftware / manticoresearch

Easy to use open source fast database for search | Good alternative to Elasticsearch now | Drop-in replacement for E in the ELK soon
https://manticoresearch.com
GNU General Public License v3.0
8.88k stars 492 forks source link

support of the ICU custom rules #2507

Open tomatolog opened 1 month ago

tomatolog commented 1 month ago

Proposal:

it could be better to add support of the custom rules into ICU integration

it could be better to add support of these or some of these options for morphology='icu_chinese' and prohibit all use of the exceptions \ wordforms \ stopwords for morphology='icu_chinese' or ngram_chars.

As cjk tokenization is related on content and exceptions \ wordforms \ stopwords \ morphology got applied at the different stages on the token processing pipeline and general content got lost there.

Checklist:

To be completed by the assignee. Check off tasks that have been completed or are not applicable.

- [ ] Implementation completed - [ ] Tests developed - [ ] Documentation updated - [ ] Documentation reviewed - [ ] [Changelog](https://docs.google.com/spreadsheets/d/1mz_3dRWKs86FjRF7EIZUziUDK_2Hvhd97G0pLpxo05s/edit?pli=1&gid=1102439133#gid=1102439133) updated - [x] OpenAPI YAML updated and issue created to rebuild clients
tomatolog commented 1 month ago

the related issue https://github.com/manticoresoftware/manticoresearch/issues/2507 there exceptions can not work with the morphology='icu_chinese'

or maybe upcoming Jieba integration from the https://github.com/manticoresoftware/manticoresearch/issues/931 could handle such cases