pemistahl / lingua-py

The most accurate natural language detection library for Python, suitable for short text and mixed-language text
Apache License 2.0
1.15k stars 45 forks source link

Distinguish between different variations of the same language #46

Open BLKSerene opened 2 years ago

BLKSerene commented 2 years ago

Hi, I'm wondering whether it is possible for lingua to distinguish between variations of the same language, for example: Simplified Chinese and Traditional Chinese, Norwegian Bokmål and Norwegian Nynorsk. AFAIK, langdetect could distinguish between Simplified and Traditional Chinese while other alternatives can't.

pemistahl commented 2 years ago

Hi @BLKSerene, thank you for your request.

The library already distinguishes between Bokmal and Nynorsk. As for Simplified and Traditional Chinese, I could not find suitable training corpora yet which solely consist of either Simplified or Traditional Chinese. Do you know a good source for those perhaps?

BLKSerene commented 2 years ago

There are two UD Chinese corpora. Simplified Chinese: https://github.com/UniversalDependencies/UD_Chinese-GSDSimp Traditional Chinese: https://github.com/UniversalDependencies/UD_Chinese-GSD What are the requirements of the training data? And license?

pemistahl commented 2 years ago

Ah, those look suitable, thank you.

For LanguageModelFilesWriter being able to create the language models, it needs training data in plain text without any annotations etc. So I would need to use a custom parser for the UD files first. The license should allow to use the language models created from the training data.

BLKSerene commented 2 years ago

The conllu package should suffice for parsing UD corpora: https://github.com/EmilStenstrom/conllu

yanqianglu commented 1 year ago

+1 on the feature request 🙏

yudelevi commented 4 months ago

If it helps anyone: in the meanwhile I've had some success identifying traditional and simplified Chinese with hanzidentifier which is based on zhon