ivanakcheurov / ntextcat

MIT License
100 stars 24 forks source link

How to use your library? #4

Open it19862 opened 5 years ago

it19862 commented 5 years ago

Could you give a small example of using your library?

win 7x64 vs - 2017

Installed "ntextcat" through "nuget" I need to determine the language of the text that is entered in "textBox2.Text". Result output in "textBox1.Text" It is supposed to enter the text: European languages, languages ​​with hieroglyphs (Chinese, Japanese) and others

Found sample code. I get a string error var identifier = factory.Load("NTextCat 0.2.1.1\\LanguageModels\\Core14.profile.xml");

cod

using NTextCat;

namespace rsh
{
    public partial class Form2 : Form
    {
        public Form2()
        {
            InitializeComponent();
        }

        private void button1_Click(object sender, EventArgs e)
        {
            var factory = new RankedLanguageIdentifierFactory();
            var identifier = factory.Load("NTextCat 0.2.1.1\\LanguageModels\\Core14.profile.xml");
            var languages = identifier.Identify(textBox2.Text);
            var mostCertainLanguage = languages.FirstOrDefault();

            textBox1.Text = mostCertainLanguage.Item1.Iso639_3;
        }
    }
}

How to solve the problem?

2018-10-14_18-48-10

mohammad-khoddami commented 4 years ago

How to detect unsupported languages text as unknown, not to another language. for example "Aţi văzut ce moacă a făcut?" is Romanian, but NTextCat detects it as English.

ivanakcheurov commented 4 years ago

I don't understand the problem from the description. If your code works correctly, then identifier would contain the language code (for example, eng for English). Perhaps you get an error and could post its screenshot?

ivanakcheurov commented 4 years ago

@mohammad-khoddami , you can assess how confident NTextCat is with the language tag.

var factory = new RankedLanguageIdentifierFactory();
var identifier = factory.Load("Core14.profile.xml");
var languages = identifier.Identify("some text");
var mostCertainLanguage = languages.FirstOrDefault();

var languageCode = mostCertainLanguage.Item1.Iso639_3;
var confidenceLevel = mostCertainLanguage.Item2;
diegosasw commented 2 years ago

How is the confidence level measured? I get values like 3495.569 for a long Spanish text that is detected properly

But I get values like 3924.144 for text in Czech which is incorrectly detected as English

Nechť již hříšné saxofony ďáblů rozezvučí síň úděsnými tóny waltzu, tanga a quickstepu.

or 3928.28 for text in Bulgarian which is incorrectly detected as Russian

Ах чудна българска земьо, полюшвай цъфтящи жита.

I suppose the models are not too accurate?


I've tried with Wiki82.profile.xml and Wiki280.profile.xml and I get better results with Wiki82.profile.xml because with Wiki280.profile.xml the texts are often detected as aa.

One thing I've noticed is that the detected language ISO code is not correct. With Core14.profile.xml I get 3 digits code properly in mostCertainLanguage.Item1.Iso639_3 but when using Wiki82.profile.xml or Wiki280.profile.xml I get 2 letter code there (which is incorrect).

andreyka26-git commented 10 months ago

@ivanakcheurov

Hello, thank you very much for your work.

May I ask about the profiles as well?

  1. As was asked above, what the weight numbers mean? As I understood the closer they to 4000 the less accurate they are, but what is the point after which we can consider them as accurate? > 3700, > 3500?

  2. I'm using wiki82.profile.xml, and sometimes I'm getting "simple" or "new" language as a result from pure english text. What do they mean?

    image
diegosasw commented 9 months ago

I suppose this library is abandoned. Any luck @andreyka26-git ?