pemistahl / lingua-py

The most accurate natural language detection library for Python, suitable for short text and mixed-language text
Apache License 2.0
1.02k stars 43 forks source link

Add fastspell to the comparison #188

Closed marco-c closed 7 months ago

marco-c commented 8 months ago

fastspell uses a combination of fastText and dictionaries to identify the language. It would be interesting to see how it compares to lingua-py.

pemistahl commented 8 months ago

Thank you for the suggestion. I will include it in the comparison.

marco-c commented 8 months ago

I have a local WIP patch, I could submit a PR in a few days.

pemistahl commented 8 months ago

Sounds good, PRs are always welcome. :)

pemistahl commented 7 months ago

@marco-c I've now updated the accuracy reports and plots to include FastSpell. It's more accurate than pure FastText, especially in aggressive mode. It also beats Lingua in low accuracy mode most of the time, even though the difference is not that big. Lingua in high accuracy mode is still unchallenged. Phew, lucky me. ;-)

In terms of runtime performance, FastSpell is significantly slower than FastText but on par with single-threaded Lingua in high accuracy mode.

Thanks again for your contribution. Let me know if you plan to use Lingua in one of your projects. I'm always curious.


Average Detection Performance
marco-c commented 7 months ago

@pemistahl the accuracy gets way higher as you add more dictionaries, for example for Italian you can look at the results in my comment here: https://github.com/mozilla/firefox-translations-training/issues/248#issuecomment-1807787199. Especially for single word and word pair, the difference is huge. And most of the errors I saw were actually labelling errors in the Wortschatz corpora.

Italian is not yet in the default fastspell configuration though, so we don't see yet the improvements in the analysis in the lingua-py repo.

I'm not sure yet which language identification library we will use for Firefox translations, lingua seems to be very good and high performance with the Rust backend, fastspell seems to be very good for very short sentences. We might consider a mix of them (and actually, you could implement dictionary lookup as a feature in lingua, just like fastspell is doing on top of fasttext).

A couple of links that might be interesting for you: https://github.com/Helsinki-NLP/OpusFilter/pull/65 https://github.com/mbanon/fastspell/pull/17

pemistahl commented 7 months ago

Thank you for considering Lingua for Firefox. That would be amazing. :)

If the usage of dictionaries adds so much to accuracy, I will think about adding dictionaries to Lingua but without the performance penalty as in FastSpell. With Rust, that should be doable.