Open kornelski opened 1 year ago
@kornelski Thank you! That is a valid point!
At the moment, the algorithm to detect a script is based on counting chars that belongs to one or another script. And the winner is the one, that gets the highest count.
Btw, you can play with the input on https://whatlang.org/
Here you can get a bit of insights, how the library works:
On the other side, you the problem is that you're feeding a mixed text. Whatlang is not designed to work with such type of input.
Mixing of languages/scripts makes it more difficult indeed, but that is unfortunately a real-world situation I wanted to solve.
Could you add weights to the scores? It could be as simple as 3x boost for CJK scripts.
Readme of this crate: https://lib.rs/crates/dcli contains Chinese simplified text with code examples in English. If I feed markdown of this file to whatlang, I get
Lang::Fra
with 0.52 confidence.I think the language detection could be strongly biased towards presence of CJK characters, because speakers of these languages are much more likely to use some latin letters, than speakers of European languages use substantial amount of CJK characters.