Why not quadgrams? - Githubissues

bee-san commented 1 year ago

Hey! Just wondering why you chose trigrams over quadgams? thanks!

greyblake commented 1 year ago

Hey, that's a good and reasonable question.

Before I started this library I was trying out different n-grams approaches implementing a library in Crystal (it's not available anywhere). At the moment I was building n-grams on my own from different text corpora I could find online. Then I found https://github.com/wooorm/franc, and tried its trigram models. They were simpler but at the same time produced better quality results and allowed to have better performance. So.. I've decided to use its trigrams models.

To be honest I never gave it a proper try to quadgrams. From one of papers I read, I recall that quadgrams would be more suitable to classify documents in the same language.

Other practical problem: in order to build n-grams models you need to have a good well balanced representative corpora. For major languages it's possible to find.. for minor languages it becomes a challenge.

Other concern: I'd like to keep the library minimalistic and fast.. or modular otherwise, so users can choose how they want to balance between quality, performance and memory footprint.

For example, lingua performs better, it applies different n-grams models. But it's much slower and takes much more memory. I did a comparison in this blog article: https://www.greyblake.com/blog/whatlang-strikes-back/

I hope I was able to answer you question.

bee-san commented 1 year ago

Hey @greyblake ! Thanks so so much for this. We have some weird requirements and I'd like your opinion:

Ideally we'd like to focus on English text. Multi language support is cool, but not 100% needed.
It has to be fast
The accuracy has to be mildly good.
We have no guarantee the input text is of any language, it could be gibberish. Traditionally in Python natural language libraries did not like this requirement.

I think WhatLang might work better for us. Specifically I think having an if statement to check if confidence is say, over 80% would be great for us (or for us to use info.is_reliable()))

We also deal with smaller texts which WhatLang does not like so much. I am thinking of using both libraries, if the text is short (20 characters) use Lingua otherwise rely on Whatlang.

What do you think?

greyblake commented 1 year ago

Last time I did benchmark Whatlang was 10-15 faster then Lingua. But.. I think both are reasonable fast. If you're fine to allocate a few hundreds Mb RAM for Lingua models, I'd say Lingua is good to go. If you're working with very large texts, It may also make sense to truncate them to ~1000 chars.

Specifically I think having an if statement to check if confidence is say, over 80% would be great for us (or for us to use info.is_reliable()))

Keep in mind, that, unfortunately Whatlang still may produce a small fraction of false positives (when it says confidence is 100%, but actually the language is not correct). It can be about 0.5% of small texts.

You can find quality benchmarks of Whatlang here: https://github.com/whatlang/whatlang-accuracy-benchmark/blob/master/reports/2022-08-30.md

greyblake / whatlang-rs

Why not quadgrams? #132