pemistahl / lingua

The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike
Apache License 2.0
689 stars 61 forks source link

feat: add swiss german as a language #164

Open bweben opened 1 year ago

bweben commented 1 year ago

Hello

I added Swiss German as another language. In order to do that, I had to move the training files into a subfolder named after the ISO 639_3 code as the _1 is not unique between German and Swiss German. For that reason I also had to change the name of the test files. If this change is not OK, I am open for suggestions on how to "fix" this problem :)

The accurracy is not that great, but this was kinda expected as Swiss German is pretty similar to German. Maybe with better training data this could be fixed. However due to the "grouping" by the ISO 639_1 code, it is probably possible to have a prediction for Swiss German and German simultanously and thus "improving" the accurracy, as far as I understand.

I got all data from here. I used the 2021 Wikipedia 100k for the training and the 2017 Web 100k for the test.

Thanks for your feedback :)