meilisearch / charabia

Library used by Meilisearch to tokenize queries and documents
MIT License
261 stars 89 forks source link

Rework Chinese Pinyin normalizer #285

Open ManyTheFish opened 7 months ago

ManyTheFish commented 7 months ago

Current implementation

The current Chinese Pinyin normalizer romanizes Chinese characters using Pinyin. But doing it this way creates more noise than it helps in finding relevant documents, and the documents matching precisely the query are no longer on the top of the results. However, the pinyin normalization is helpful for retrieving Chinese characters by typing their romanized version in the search bar.

Change Proposal

Re-implement a new normalizer that reverts the current behavior. This new normalizer should be able to detect if a Latin token matches a Pinyin sequence and then convert it into Chinese characters. This way, if a Chinese character is written, it would no longer match other characters sharing the same Pinyin version, but a user would be able to retrieve the Chinese words by writing their Pinyin version in the search bar.

Dependencies

This implementation depends on another improvement on Charabia. Charabia should be able to have alternative versions of the same token.