meilisearch / charabia

Library used by Meilisearch to tokenize queries and documents
MIT License
261 stars 89 forks source link

normalize Ð and Đ into d #257

Closed ngdbao closed 10 months ago

ngdbao commented 10 months ago

Pull Request

Related issue

Fixes issue #<245>

What does this PR do?

PR checklist

Please check if your PR fulfills the following requirements:

curquiza commented 10 months ago

@ngdbao thank you for the PR can you fix the Rustfmt CI before we review it please? 😊

jzabroski commented 10 months ago

Isn't d and D with stroke a different letter? I think that may negatively affect downstream tokenization in an n-gram language model.

curquiza commented 10 months ago

(@jzabroski, a detail, there is still the issue with Rustfmt CI 😇)

ngdbao commented 10 months ago

(@jzabroski, a detail, there is still the issue with Rustfmt CI 😇)

sorry my bad, I'm starting with zero-knowledge in Rust, trying to arrange Rust local-environment

meili-bors[bot] commented 10 months ago

Build succeeded: