meilisearch / charabia

Library used by Meilisearch to tokenize queries and documents
MIT License
261 stars 89 forks source link

Ð vs Đ differentiate #245

Closed ngdbao closed 9 months ago

ngdbao commented 1 year ago

Hi !

Everything works in v0.30.0 "Đà Lạt" Screenshot 2023-10-21 at 18 21 38

Since I upgraded to any version higher than v0.30.0 keyboard typing "Đà Lạt" => no result matched Screenshot 2023-10-21 at 18 26 17

then I tried to copy text from source then paste in input "Ðà Lạt" Screenshot 2023-10-21 at 18 29 11 it works ! Strange

Took a while before I gave a question to chatGPT, and this is the answer: Ð is Unicode: U+00D0 Đ is Unicode: U+0110

Even they look exactly same Screenshot 2023-10-21 at 18 29 45

My expectation: Ð vs Đ (and their cases) should be treated as the same, hopefully

ManyTheFish commented 1 year ago

Hello @ngdbao, thank you for reporting these spoofing variants. It's a straightforward issue to solve by adding a new normalizer in the pipeline, so I put it as a good first issue for any external contributor. Don't hesitate to contribute by yourself; you don't need good knowledge of Rust or the repository to do so.

There is a small tutorial on how to implement a Normalizer in Charabia directly in the CONTRIBUTING.md file. However, if you struggle to implement the Normalizer or have a question, don't hesitate to open a draft PR and ping me in the comments.

Thanks again for your issue.

jzabroski commented 11 months ago

I'd be interested in trying this but would like a sample test of a similar issue. I'm off from work this week so can probably contribute a lot.

ManyTheFish commented 11 months ago

Hello @jzabroski, don't hesitate to tackle this issue, it's an easy one 😄

jzabroski commented 11 months ago

I'm new to the code, so I don't know how easy it is, but... I can actually see a couple ways this can be done

  1. Needs a Vietnamese Nornalizer?

Or

  1. Needs a Unicode Visual Spoofing Normalizer? https://websec.github.io/unicode-security-guide/visual-spoofing/
  2. If yes, does the API support "did you mean" hints to record the search redirect?
  3. Where would I plug-in such a generic Normalizer? I don't think there is a Normalization pass of the text, in looking at the code, where I can plug in additional passes. It's also not clear if the ideal architecture in such case is a Super-fold operation so that visual spoofing comes first and immediately passes character chunks as they're ready to the next Normalizer.
ManyTheFish commented 11 months ago

Good question. The issue with globally normalizing the visual spoofing variants is that we could go too far because we are not parsing URLs. However, we could rely on a spoofing variant normalizer and activate it Language by Language when we are confident with doing it. As a good quick start, I encourage you to read the CONTRIBUTING.md file, it will guide you into creating a normalizer, however, if you have any doubts or questions, don't hesitate to create a draft PR and ping me on it. 😄

ngdbao commented 10 months ago

@ManyTheFish I created a PR with zero-knowledge in Rust Please review and give some fix to make it works as expectation. https://github.com/meilisearch/charabia/pull/257

timvisee commented 10 months ago

I think this can be closed now since https://github.com/meilisearch/charabia/pull/257 is merged and released.