pemistahl / lingua-rs

The most accurate natural language detection library for Rust, suitable for short text and mixed-language text
Apache License 2.0
870 stars 38 forks source link

Proposal: New Language Models and Discussion on Norwegian Variants #324

Open kareglazie opened 6 months ago

kareglazie commented 6 months ago

Hello,

Thank you for your great Lingua crate!

As part of our efforts to adapt Lingua for our production environment and requirements, we've been working on extending its language support. We believe these enhancements can also be beneficial for the wider Lingua community and would like to participate in mainstream development by contributing our changes.

Added Language Models

We have introduced models for the following languages:

Language avg-low-ac single-low-ac pairs-low-ac sent-low-ac avg-high-ac single-high-ac pairs-high-ac sent-high-ac
Amharic 100 100 100 100 100 100 100 100
Burmese 99 100 100 99 100 100 100 100
Chechen 83 77 85 86 86 86 88 86
Kyrgyz 54 37 37 89 58 45 41 89
Malayalam 100 100 100 100 100 100 100 100
Nepali 35 13 26 66 41 21 29 72
Pashto 79 63 76 97 89 7 92 99
Sanskrit 40 19 34 67 56 37 49 82
Sinhala 100 100 100 100 100 100 100 100
Sindhi 66 49 60 89 87 73 89 98
Tatar 43 21 29 80 47 26 34 80
Tajik 79 65 73 98 89 81 85 99
Turkmen 28 44 16 23 30 48 17 23
Uzbek 90 82 88 99 96 92 97 99
Lao 99 100 100 99 99 99 100 99
Khmer 100 100 100 100 100 100 100 100

Norwegian Language Model Consideration

Additionally, during our development, we identified the need to consolidate the Norwegian language models. Originally, Lingua supports both Bokmål and Nynorsk. However, for our specific use case, a singular Norwegian model proved to be more effective. Therefore, we've replaced Bokmål with a more general Norwegian model in our branch.

This change raises an important question for the Lingua project: Would there be interest in adding a unified Norwegian model alongside the existing Bokmål and Nynorsk models, or would you prefer maintaining the distinct form of Norwegian as currently represented by Bokmål and Nynorsk? We're open to reverting our Norwegian model to separate Bokmål and Nynorsk models to align with your preferences.

Here's the link to our branch: https://github.com/kareglazie/lingua-rs/tree/new-langs

pemistahl commented 6 months ago

Hi Svetlana,

thank you for your effort to enhance my library with more languages. This is great. :) Can you please open a pull request? Then it's easier to review your changes and additions and to comment on them.

As for Norwegian, I prefer to treat Bokmal and Nynorsk separately because they are basically two different variants of written Norwegian. I want my library to be able to differentiate between them.

kareglazie commented 6 months ago

Hello! Thanks for your reply. I've opened the PR and removed general Norwegian from models (now there are two separate variants, as it was originally in your crate).