Rework Chinese Pinyin normalizer

Current implementation

The current Chinese Pinyin normalizer romanizes Chinese characters using Pinyin. But doing it this way creates more noise than it helps in finding relevant documents, and the documents matching precisely the query are no longer on the top of the results. However, the pinyin normalization is helpful for retrieving Chinese characters by typing their romanized version in the search bar.

Change Proposal

Re-implement a new normalizer that reverts the current behavior. This new normalizer should be able to detect if a Latin token matches a Pinyin sequence and then convert it into Chinese characters. This way, if a Chinese character is written, it would no longer match other characters sharing the same Pinyin version, but a user would be able to retrieve the Chinese words by writing their Pinyin version in the search bar.

Dependencies

This implementation depends on another improvement on Charabia. Charabia should be able to have alternative versions of the same token.

meilisearch / charabia