optimaize / language-detector

Language Detection Library for Java
Apache License 2.0
568 stars 165 forks source link

Adding recognition of Walloon (wa) language #44

Open srtxg opened 8 years ago

srtxg commented 8 years ago

Hello, I'm working on adding Walloon language to LanguageTool, which itself requires proper language detection from language-detector. I don't see any clear instructions on how to generate a profile; so, as suggested, I'll attach some text files: http://chanae.walon.org/walon/wa.zip It's a small zip file with some random pages from Wikipedia and rifondou.walon.org (for that last one, I only took texts more than 70 years old); it's about 2MB of text. The zip include plain text dumps, as well as the html pages (which most often include, lang=... tags, in case it may be useful for you)

Another thing to know about Walloon, is that there are actually two ways of writting it. A "unified orthography", called "rifondou" (which is the one used in those texts). And a traditional "feller" one; which does a lot of emphasis on local accent and phonetic, with the consequence that is actually not one orthography, but a group of orthographies (at a very least there are four main groups: western, central, easter and south).

What would be the best thing to do:

Thanks wa.zip

srtxg commented 8 years ago

Ok, I managed to create it thanks to the help from rmtheis. I did a pull request ( #50 ) with it.

fabiankessler commented 8 years ago

Thank you! Walloon is in now. Can you tell us which way you went? Is the language profile only rifondou, or more?

srtxg commented 8 years ago

Thanks, The pull request I did is only for normalized orthography ("rifondou").

Currently all the walloon language tools (like spell checker, the start of work in grammar tool LT), are in normalized orthography. However, maybe having a tool to easily and automatically tell in which variant/dialect a text is written could be handy. I'll a have a meeting this month and bring the topic to see what other people think about it.

james-s-w-clark commented 4 years ago

RE creating language profiles, instructions are at https://github.com/optimaize/language-detector/wiki/Creating-Language-Profiles