Proposition: Using prior language probability to increase likelihood

slavaGanzin commented 1 year ago

@pemistahl Peter, I think it would be beneficial for this library to have a separate method that will add probability prior (in a Bayesian way) to the mix.

Let's look into statistics: https://en.wikipedia.org/wiki/Languages_used_on_the_Internet

So if 57% of texts, that you see on the internet, are in English so, if you predicted "English" for any input you would be wrong only in 43%. It's like a stopped clock, but it is right every second probe.

For example: https://github.com/pemistahl/lingua-py/issues/100

Based on that premise, if we are using just plain character statistics "как дела" is more Macedonian than Russian. But overall, if we add language statistics to the mix, lingua-puy would be "wrong" less often.

There are more Russian-speaking users of this library, than Macedonians, just because there are more Russian-speaking people overall. And so when a random user writes "как дела" it's "more accurate" to predict "russian" than "macedonian", just because in general that is what is expected by these users.

So my proposition to add detector.detect_language_with_prior function and factorize it with prior: likelihood = probability X prior_probability

For example: https://github.com/pemistahl/lingua-py/issues/97

detector.detect_language_of("Hello")

"ITALIAN": 0.9900000000000001,
"SPANISH": 0.8457074930316446,
"ENGLISH": 0.6405700388041755,
"FRENCH": 0.260556921899765,
"GERMAN": 0.01,
"CHINESE": 0,
"RUSSIAN": 0

detector.detect_language_with_prior("Hello")

# Of course constants are for illustrative purposes only.
# Results should be normalized afterwords
"ENGLISH": 0.6405700388041755 * 0.577,
"SPANISH": 0.8457074930316446 * 0.045,
"ITALIAN": 0.9900000000000001 * 0.017,
"FRENCH": 0.260556921899765 * 0.039,

Linked issues:

pemistahl commented 1 year ago

Hi @slavaGanzin, thank you for this very interesting idea. :) I will evaluate whether the overall accuracy improves when applying prior probabilities.

duboff commented 1 year ago

I agree this should dramatically increase quality. After using lingua-py in production at scale, we've noticed quite a few instances of small languages (eg. Bulgarian, Macedonian) predicted over much more likely ones

nickchomey commented 1 year ago

Another related suggestion - allow us to pass in a dictionary with language:probability pairs to suggest what the language is expected to be, and either use this to break ties or even build it into the model's probability calculation somehow. Beyond just the possibility that such a mechanism might improve results generally, it could give us significantly more control over our specific domains and use cases.

Let's say we're using social media data and we know (or have concluded) the primary language for each user. It would be useful to be able to tell lingua (perhaps even with some sort of probability, calculated from the language breakdown of the user's prior posts) what the expected language might be.

E.g. I post in English 99% of the time, but sometimes I write in Spanish. So, in an ambiguous situation, it would be better to conclude that it is English. But, if I had other contextual metadata available (e.g. Knowing that the post is from a Spanish-centric group/page/hashtag etc...), the pre-provided probability could be different.

If no argument is passed in, it could use some sort of global default, perhaps the one suggested by OP, which we could override for our own domains with a .env file. This .env file could also make it easier to filter the permissible languages that are normally passed in as an argument - if nothing passed, use the languages set in env. If nothing in env, use all languages.

slavaGanzin commented 1 year ago

@nickchomey .env approach sounds scary. This can be a second parameter to a function with default values equal to general language distributions, which you can override by providing your own.

bhaveshkr commented 1 year ago

I agree this should dramatically increase quality. After using lingua-py in production at scale, we've noticed quite a few instances of small languages (eg. Bulgarian, Macedonian) predicted over much more likely ones

Hi duboff, I find lingua to be extremely slow like 10-20 strings/secs on MacBook Pro. Can you suggest some approach to make it usable in the prod environment?

pemistahl commented 1 year ago

@bhaveshkr I've just written down some performance tips in the README. You probably want to read them.

duboff commented 1 year ago

@pemistahl It's Great to see a new version! I was a bit afraid. Without putting undue pressure on you, do you think you are likely to consider the idea in this Issue or something similar any time soon?

Hi duboff, I find lingua to be extremely slow like 10-20 strings/secs on MacBook Pro. Can you suggest some approach to make it usable in the prod environment?

I just did exactly what the readme told me, but our use case is typically short-ish strings. We run it on AWS Lambda where it works fine with increased timeout.

pemistahl commented 12 months ago

Without putting undue pressure on you, do you think you are likely to consider the idea in this Issue or something similar any time soon?

@duboff Half a year ago or so, I did a quick evaluation of applying hard-coded prior probabilities. But the overall detection accuracy decreased significantly. So the proposed approach in this issue is not as promising as you may expect. I've kept this issue open so far as I think that it's worth doing more experiments in this direction. Not having enough free time is the limiting factor. This is an open source project, however, so feel free to fork and implement improvements yourself. I'm always happy about pull requests.

nickchomey commented 12 months ago

I'm just going to reiterate that I think the approach I suggested is clearly the right one - allow us to pass in our own probabilities rather than have them hardcoded.

https://github.com/pemistahl/lingua-py/issues/101#issuecomment-1421255807

pemistahl / lingua-py

Proposition: Using prior language probability to increase likelihood #101