greyblake / whatlang-rs

Natural language detection library for Rust. Try demo online: https://whatlang.org/
https://whatlang.org/
MIT License
969 stars 109 forks source link

[Question] Confidence in short text #43

Closed GrayJack closed 5 years ago

GrayJack commented 5 years ago

I'm working in a research that requires me to detect the language of articles based solely on the title, which there cases that the title have 3 to 7 words.

In the live demo I noticed that English and German require more words to have a good confidence than for Portuguese and Spanish.

I used random articles titles to test it.

There a way to optimize it, like a configuration in Options to use user specified n-grams of some kind? If not, there is another lib that you're aware of that maybe can satisfy my needs?

greyblake commented 5 years ago

Hi! This option is not available in the demo, but you can specify whitelist or blacklist. Please see the documentation: https://docs.rs/whatlang/0.7.1/whatlang/

Let's say, you know ahead, that the given text must only either russian or english. You can use a whitelist, like:

use whatlang::{Detector, Lang};

let whitelist = vec![Lang::Eng, Lang::Rus];

// You can also create detector using with_blacklist function
let detector = Detector::with_whitelist(whitelist);
let lang = detector.detect_lang("There is no reason not to learn Esperanto.");
assert_eq!(lang, Some(Lang::Eng));

If whitelisting/blacklisting does not help for your task, then this library is not a proper choice. You may need to try use something else, that is based on dictionary. Unfortunately I don't know any rust library for this.

Please let me know, if this solves the issue (so I can close it)

GrayJack commented 5 years ago

Hi, I'll do some tests this week and let you know, thanks!!

greyblake commented 5 years ago

@GrayJack Hi! Any updates?

greyblake commented 5 years ago

Please reopen if you find the issue still relevant.