optimaize / language-detector

Language Detection Library for Java
Apache License 2.0
567 stars 165 forks source link

Short text recognition #65

Open D063520 opened 7 years ago

D063520 commented 7 years ago

Hi,

thank you for providing this library! I am interested in very short texts like "capital Italy". With the other version of this library, i.e. https://github.com/shuyo/language-detection I got quite good results. With this version it is different. Is it a matter of configurations? Do you have an idea what it can be? I use: TextObjectFactory textObjectFactory = CommonTextObjectFactories.forDetectingShortCleanText();

Here are some examples that were working in the "previeus version":

eclectice commented 7 years ago

You can check my forked version which I've added the build.gradle for building pure Java library with Android Studio 2.2: https://github.com/eclectice/language-detector

In my version, I have added more shorttext language resources and added more shorttext data in the DataLanguageDetectorImplTest.java which needs TestNG test framework to test upon (need to enable test option useTestNG() and disable useJUnit() in the build.gradle):

    @DataProvider
    protected Object[][] shortCleanTexts() {
        return new Object[][] {
                {"en", shortCleanText("This is some English text.")},
                {"fr", shortCleanText("Ceci est un texte français.")},
                {"nl", shortCleanText("Dit is een Nederlandse tekst.")},
                {"de", shortCleanText("Dies ist eine deutsche Text")},
                {"km", shortCleanText("សព្វវចនាធិប្បាយសេរីសម្រាប់អ្នកទាំងអស់គ្នា។" + "នៅក្នុងវិគីភីឌាភាសាខ្មែរឥឡូវនេះមាន ១១៩៨រូបភាព សមាជិក១៥៣៣៣នាក់ និងមាន៤៥៨៣អត្ថបទ។")},
                {"bg", shortCleanText("Европа не трябва да стартира нов конкурентен маратон и изход с приватизация")},
                {"it", shortCleanText("Persone nate a padova")},
                {"it", shortCleanText("attori canada")},
                {"de", shortCleanText("Was ist die hauptstadt von kanada")},
                {"pl", shortCleanText("I Kanadyjczycy")},
                {"en", shortCleanText("actors from Canada")},
        };
    }