pemistahl / lingua

The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike
Apache License 2.0
689 stars 61 forks source link

IsoCode639_1 is ambiguous #163

Closed bweben closed 1 year ago

bweben commented 1 year ago

Hi

I'm trying to add a new language (or at least I'm trying to follow your contribution guide to add a new language). Now I ran into a problem as the language Swiss German does not have a unique IsoCode639_1 code but probably shares the same as german. The IsoCode639_3 is GSW.

What is your preferred way to go forward? Add an imaginary IsoCode639_1 code? Or maybe add subfolders with the IsoCode639_3 code?

Thanks for your response and this awesome project.

pemistahl commented 1 year ago

Hi Nathanael, thank you for reaching out to me.

It's great that you want to contribute a Swiss German language model to Lingua. I think it's perfectly valid to use the existing IsoCode639_1.DE for Swiss German. You then just need to add IsoCode639_3.GSW as you have already found out. I haven't tried it yet but there should not be any problems if you assign the same iso code to more than one language. It is the correct way to do. If you encounter any problems with this approach, then please let me know.

Looking forward to your PR. Thanks a lot. Und viele Grüße in die Schweiz. :)

bweben commented 1 year ago

Hi Peter

OK, I created a PR now. I had to change some things, see https://github.com/pemistahl/lingua/pull/164

Besten Dank, viele Grüsse zurück nach Deutschland :D