Closed ngdbao closed 9 months ago
Hello @ngdbao,
thank you for reporting these spoofing variants. It's a straightforward issue to solve by adding a new normalizer in the pipeline, so I put it as a good first issue
for any external contributor. Don't hesitate to contribute by yourself; you don't need good knowledge of Rust or the repository to do so.
There is a small tutorial on how to implement a Normalizer in Charabia directly in the CONTRIBUTING.md file. However, if you struggle to implement the Normalizer or have a question, don't hesitate to open a draft PR and ping me in the comments.
Thanks again for your issue.
I'd be interested in trying this but would like a sample test of a similar issue. I'm off from work this week so can probably contribute a lot.
Hello @jzabroski, don't hesitate to tackle this issue, it's an easy one 😄
I'm new to the code, so I don't know how easy it is, but... I can actually see a couple ways this can be done
Or
Good question. The issue with globally normalizing the visual spoofing variants is that we could go too far because we are not parsing URLs. However, we could rely on a spoofing variant normalizer and activate it Language by Language when we are confident with doing it. As a good quick start, I encourage you to read the CONTRIBUTING.md file, it will guide you into creating a normalizer, however, if you have any doubts or questions, don't hesitate to create a draft PR and ping me on it. 😄
@ManyTheFish I created a PR with zero-knowledge in Rust Please review and give some fix to make it works as expectation. https://github.com/meilisearch/charabia/pull/257
I think this can be closed now since https://github.com/meilisearch/charabia/pull/257 is merged and released.
Hi !
Everything works in v0.30.0 "Đà Lạt"
Since I upgraded to any version higher than v0.30.0 keyboard typing "Đà Lạt" => no result matched
then I tried to copy text from source then paste in input "Ðà Lạt" it works ! Strange
Took a while before I gave a question to chatGPT, and this is the answer: Ð is Unicode: U+00D0 Đ is Unicode: U+0110
Even they look exactly same
My expectation: Ð vs Đ (and their cases) should be treated as the same, hopefully