Mixed Chinese-English detected as French

kornelski commented 1 year ago

Readme of this crate: https://lib.rs/crates/dcli contains Chinese simplified text with code examples in English. If I feed markdown of this file to whatlang, I get Lang::Fra with 0.52 confidence.

"# dcli\n数据库连接工具\n\n## 概述\n\ndcli 是一个简单的数据库管理工具。因为个人习惯喜欢用命令行，在平时工作中经常需要通过 mysql-client 连接到多个 mysql 数据库，每次连接都需要敲一长串参数或在历史记录中查找之前输入参数。我希望有一个可以替我保管多个 mysql 连接信息，在需要时指定连接名称就能连上数据库的工具，dcli 由此而来。\n\n注意: dcli 目前还使用明文保存密码!!!\n\n## 特性\n\n### 无 mysql-client 和 openssl 依赖\n\n不喜欢在换了一台机器后需要安装额外的 mysql-client 依赖, 特别是 SSL 连接使用的 openssl, 有时候安装 openssl 本身就是一个大麻烦。所以 dcli 使用了纯 rust 实现的 mysql 连接工具 sqlx, 而且最近版本的 sqlx 可以通过 rustls 特性使用 rustls 替换 native-tls, 所以无需担心 openssl 的依赖问题🎉。\n\n### 可调整表格样式\n\n### 支持 i18n\n\n通过条件编译和

I think the language detection could be strongly biased towards presence of CJK characters, because speakers of these languages are much more likely to use some latin letters, than speakers of European languages use substantial amount of CJK characters.

greyblake commented 12 months ago

@kornelski Thank you! That is a valid point!

At the moment, the algorithm to detect a script is based on counting chars that belongs to one or another script. And the winner is the one, that gets the highest count.

Btw, you can play with the input on https://whatlang.org/

Here you can get a bit of insights, how the library works:

On the other side, you the problem is that you're feeding a mixed text. Whatlang is not designed to work with such type of input.

kornelski commented 12 months ago

Mixing of languages/scripts makes it more difficult indeed, but that is unfortunately a real-world situation I wanted to solve.

Could you add weights to the scores? It could be as simple as 3x boost for CJK scripts.

greyblake / whatlang-rs

Mixed Chinese-English detected as French #136