Open BLKSerene opened 2 years ago
Hi @BLKSerene, thank you for your request.
The library already distinguishes between Bokmal and Nynorsk. As for Simplified and Traditional Chinese, I could not find suitable training corpora yet which solely consist of either Simplified or Traditional Chinese. Do you know a good source for those perhaps?
There are two UD Chinese corpora. Simplified Chinese: https://github.com/UniversalDependencies/UD_Chinese-GSDSimp Traditional Chinese: https://github.com/UniversalDependencies/UD_Chinese-GSD What are the requirements of the training data? And license?
Ah, those look suitable, thank you.
For LanguageModelFilesWriter
being able to create the language models, it needs training data in plain text without any annotations etc. So I would need to use a custom parser for the UD files first. The license should allow to use the language models created from the training data.
The conllu
package should suffice for parsing UD corpora: https://github.com/EmilStenstrom/conllu
+1 on the feature request 🙏
If it helps anyone: in the meanwhile I've had some success identifying traditional and simplified Chinese with hanzidentifier which is based on zhon
Hi, I'm wondering whether it is possible for
lingua
to distinguish between variations of the same language, for example: Simplified Chinese and Traditional Chinese, Norwegian Bokmål and Norwegian Nynorsk. AFAIK,langdetect
could distinguish between Simplified and Traditional Chinese while other alternatives can't.