I'm encountering an issue where detecting text that contains nearly equal amounts of French and English consistently yields a confidence value of 1.0 for French, while English receives 0.0 confidence.
Here’s a minimal example:
from lingua import Language, LanguageDetectorBuilder
languages = [Language.ENGLISH, Language.FRENCH]
detector = LanguageDetectorBuilder.from_languages(*languages).build()
text = "In today’s interconnected world, learning multiple languages has become more important than ever. With globalization, the ability to communicate across cultures is a significant advantage. Dans le monde interconnecté d'aujourd'hui, apprendre plusieurs langues est plus important que jamais. Avec la mondialisation, la capacité à communiquer à travers les cultures est un avantage considérable."
confidence_values = detector.compute_language_confidence_values(text)
for confidence in confidence_values:
print(f"{confidence.language.name}: {confidence.value:.2f}")
output:
FRENCH: 1.00
ENGLISH: 0.00
I attempted to reduce the influence of shared words between French and English by adding with_minimum_relative_distance:
When I remove one of the French sentences, the confidence shifts entirely to English:
text = "In today’s interconnected world, learning multiple languages has become more important than ever. With globalization, the ability to communicate across cultures is a significant advantage. Being multilingual not only enhances communication but also opens up opportunities for personal and professional growth. Dans le monde interconnecté d'aujourd'hui, apprendre plusieurs langues est plus important que jamais."
Output:
ENGLISH: 1.00
FRENCH: 0.00
Is there a way to adjust the confidence values so that they better reflect the actual balance of languages in the text?
I'm encountering an issue where detecting text that contains nearly equal amounts of French and English consistently yields a confidence value of 1.0 for French, while English receives 0.0 confidence.
Here’s a minimal example:
output:
I attempted to reduce the influence of shared words between French and English by adding with_minimum_relative_distance:
However, the output remains the same:
When I remove one of the French sentences, the confidence shifts entirely to English:
Output:
Is there a way to adjust the confidence values so that they better reflect the actual balance of languages in the text?