I noticed that for a particular string (used in the code below), I am getting correct language detection when I use lingua-py, but lingua gives me bad results.
Here is the python version:
detector = LanguageDetectorBuilder.from_all_languages().with_preloaded_language_models().build() detector.compute_language_confidence_values("Fast shipment. easy order. excellent costumer care!AAA+++") [ConfidenceValue(language=Language.ENGLISH, value=0.4054392221734394), ConfidenceValue(language=Language.TAGALOG, value=0.17739618771502366), ConfidenceValue(language=Language.FRENCH, value=0.05126428674609979), ConfidenceValue(language=Language.DANISH, value=0.045731114924862044), ConfidenceValue(language=Language.LATIN, value=0.02444471663598676), ConfidenceValue(language=Language.DUTCH, value=0.02432267933976313), ConfidenceValue(language=Language.ITALIAN, value=0.020561633912509245)
You can see the the detected language is English with TAGALOG a distant second.
But running the same with Java always gives me TAGALOG as first.
Here is the code:
I noticed that for a particular string (used in the code below), I am getting correct language detection when I use
lingua-py
, butlingua
gives me bad results.Here is the python version:
detector = LanguageDetectorBuilder.from_all_languages().with_preloaded_language_models().build() detector.compute_language_confidence_values("Fast shipment. easy order. excellent costumer care!AAA+++") [ConfidenceValue(language=Language.ENGLISH, value=0.4054392221734394), ConfidenceValue(language=Language.TAGALOG, value=0.17739618771502366), ConfidenceValue(language=Language.FRENCH, value=0.05126428674609979), ConfidenceValue(language=Language.DANISH, value=0.045731114924862044), ConfidenceValue(language=Language.LATIN, value=0.02444471663598676), ConfidenceValue(language=Language.DUTCH, value=0.02432267933976313), ConfidenceValue(language=Language.ITALIAN, value=0.020561633912509245)
You can see the the detected language is English with TAGALOG a distant second.
But running the same with Java always gives me TAGALOG as first. Here is the code:
Set<Language> languages = Sets.newHashSet(); languages.addAll(Language.all()); com.github.pemistahl.lingua.api.LanguageDetector LANGUAGE_DETECTOR = LanguageDetectorBuilder.fromLanguages(languages.toArray(new Language[0])) .build(); System.out.printf("Lingua detected language: %s \n", LANGUAGE_DETECTOR.detectLanguageOf(text)); SortedMap<Language, Double> confidenceValues = LANGUAGE_DETECTOR.computeLanguageConfidenceValues("Fast shipment. easy order. excellent costumer care!AAA+++"); System.out.println("Confidence Values: " + confidenceValues);
Output:Lingua detected language: TAGALOG Confidence Values: {TAGALOG=1.0, ENGLISH=0.9775208830833435, DUTCH=0.91419917345047, DANISH=0.9110179543495178, AFRIKAANS=0.894393265247345, LATIN=0.8940380811691284, FRENCH=0.8823490738868713, YORUBA=0.8764988780021667, MAORI=0.8754469156265259, ITALIAN=0.8731745481491089, NYNORSK=0.8615179061889648, XHOSA=0.8594332337379456, SWEDISH=0.8565084338188171, FINNISH=0.8524059057235718, INDONESIAN=0.8513398766517639, TURKISH=0.8496785759925842, BOKMAL=0.8471575379371643, ESPERANTO=0.8467212915420532, WELSH=0.8453010320663452, GERMAN=0.8415506482124329, SOTHO=0.8362245559692383, SWAHILI=0.8316057324409485, PORTUGUESE=0.830149233341217, MALAY=0.8224523663520813, ICELANDIC=0.8180419206619263, ROMANIAN=0.8172013163566589, SPANISH=0.8142555356025696, BASQUE=0.8130630850791931, ALBANIAN=0.8069103360176086, TSWANA=0.7953811883926392, ZULU=0.79490727186203, ESTONIAN=0.7938821315765381, SLOVAK=0.7922097444534302, GANDA=0.785712480545044, TSONGA=0.7855940461158752, CZECH=0.783637285232544, POLISH=0.7821556329727173, SLOVENE=0.7810076475143433, HUNGARIAN=0.7796627283096313, LITHUANIAN=0.770524799823761, IRISH=0.765510082244873, SHONA=0.76349276304245, AZERBAIJANI=0.7579408884048462, CROATIAN=0.7558900117874146, CATALAN=0.7528392672538757, BOSNIAN=0.7519121766090393, VIETNAMESE=0.750431478023529, SOMALI=0.7487541437149048, LATVIAN=0.7310714721679688}
I also noticed that lingua_py's version is 1.3.2 whereas the latest version available for Java is 1.2.2
This probably means that the Java version needs to be updated to pick up the new language models. Any plans on doing so?