pemistahl / lingua

The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike
Apache License 2.0
689 stars 61 forks source link

Bad results with Java version #180

Closed aamirbutt closed 1 year ago

aamirbutt commented 1 year ago

I noticed that for a particular string (used in the code below), I am getting correct language detection when I use lingua-py, but lingua gives me bad results.

Here is the python version: detector = LanguageDetectorBuilder.from_all_languages().with_preloaded_language_models().build() detector.compute_language_confidence_values("Fast shipment. easy order. excellent costumer care!AAA+++") [ConfidenceValue(language=Language.ENGLISH, value=0.4054392221734394), ConfidenceValue(language=Language.TAGALOG, value=0.17739618771502366), ConfidenceValue(language=Language.FRENCH, value=0.05126428674609979), ConfidenceValue(language=Language.DANISH, value=0.045731114924862044), ConfidenceValue(language=Language.LATIN, value=0.02444471663598676), ConfidenceValue(language=Language.DUTCH, value=0.02432267933976313), ConfidenceValue(language=Language.ITALIAN, value=0.020561633912509245)

You can see the the detected language is English with TAGALOG a distant second.

But running the same with Java always gives me TAGALOG as first. Here is the code:

Set<Language> languages = Sets.newHashSet(); languages.addAll(Language.all()); com.github.pemistahl.lingua.api.LanguageDetector LANGUAGE_DETECTOR = LanguageDetectorBuilder.fromLanguages(languages.toArray(new Language[0])) .build(); System.out.printf("Lingua detected language: %s \n", LANGUAGE_DETECTOR.detectLanguageOf(text)); SortedMap<Language, Double> confidenceValues = LANGUAGE_DETECTOR.computeLanguageConfidenceValues("Fast shipment. easy order. excellent costumer care!AAA+++"); System.out.println("Confidence Values: " + confidenceValues); Output:

Lingua detected language: TAGALOG Confidence Values: {TAGALOG=1.0, ENGLISH=0.9775208830833435, DUTCH=0.91419917345047, DANISH=0.9110179543495178, AFRIKAANS=0.894393265247345, LATIN=0.8940380811691284, FRENCH=0.8823490738868713, YORUBA=0.8764988780021667, MAORI=0.8754469156265259, ITALIAN=0.8731745481491089, NYNORSK=0.8615179061889648, XHOSA=0.8594332337379456, SWEDISH=0.8565084338188171, FINNISH=0.8524059057235718, INDONESIAN=0.8513398766517639, TURKISH=0.8496785759925842, BOKMAL=0.8471575379371643, ESPERANTO=0.8467212915420532, WELSH=0.8453010320663452, GERMAN=0.8415506482124329, SOTHO=0.8362245559692383, SWAHILI=0.8316057324409485, PORTUGUESE=0.830149233341217, MALAY=0.8224523663520813, ICELANDIC=0.8180419206619263, ROMANIAN=0.8172013163566589, SPANISH=0.8142555356025696, BASQUE=0.8130630850791931, ALBANIAN=0.8069103360176086, TSWANA=0.7953811883926392, ZULU=0.79490727186203, ESTONIAN=0.7938821315765381, SLOVAK=0.7922097444534302, GANDA=0.785712480545044, TSONGA=0.7855940461158752, CZECH=0.783637285232544, POLISH=0.7821556329727173, SLOVENE=0.7810076475143433, HUNGARIAN=0.7796627283096313, LITHUANIAN=0.770524799823761, IRISH=0.765510082244873, SHONA=0.76349276304245, AZERBAIJANI=0.7579408884048462, CROATIAN=0.7558900117874146, CATALAN=0.7528392672538757, BOSNIAN=0.7519121766090393, VIETNAMESE=0.750431478023529, SOMALI=0.7487541437149048, LATVIAN=0.7310714721679688}

I also noticed that lingua_py's version is 1.3.2 whereas the latest version available for Java is 1.2.2

This probably means that the Java version needs to be updated to pick up the new language models. Any plans on doing so?