pemistahl / lingua-py

The most accurate natural language detection library for Python, suitable for short text and mixed-language text
Apache License 2.0
1.08k stars 44 forks source link

Please provide performance metrics in the benchmarks #122

Closed nickchomey closed 8 months ago

nickchomey commented 1 year ago

I'm impressed by the accuracy of Lingua as compared to even fasttext, but it would be very useful to also see performance metrics in the benchmarks to determine if that accuracy comes at a cost. Likewise it would be useful for comparing lingua's low and high accuracy modes.

pemistahl commented 1 year ago

In chapter 9.5 of the README it says: Lingua's high detection accuracy comes at the cost of being noticeably slower than other language detectors.

The statistical models in Lingua are larger than those of similar libraries. So querying them takes more time.

There is a benchmark script in this repo which gives you a clue how performant the library is. You can run it locally with poetry:

poetry run python3 scripts/benchmark.py
nickchomey commented 1 year ago

Thanks, I'll have to give that a try and share some rough results here. I do think it would be nice/useful to present such stats in the official benchmark comparisons as there's no way to know what "noticeably slower" means. I know that Fasttext and cld2 tend to be exceptionally fast, so perhaps noticeably slower is still quite acceptable. But if it's a difference of 0.001s vs 1s, then obviously that's a problem.

datatalking commented 1 year ago

@nickchomey I'm relatively new to this repo but it has more languages than the translation repo I have been using. Could help test and show an "output chart" or help craft then submit a PR for this, so I'm willing to collab with you to look at a few options to generate the stats.

nickchomey commented 1 year ago

@datatalking this isn't a focus for me at the moment and probably won't be for at least a few months, so Im not able to collaborate on anything. But if you have time and desire to do so, that would be great!

pemistahl commented 8 months ago

Performance metrics are now provided in the README.