sergey-tihon / Stanford.NLP.NET

Stanford NLP for .NET
http://sergey-tihon.github.io/Stanford.NLP.NET/
MIT License
598 stars 123 forks source link

Sample code for using ArabicTokenizer #18

Closed alismart closed 3 years ago

alismart commented 9 years ago

Sergey, i did my best to understand how to use the ArabicTokenizer, you can see my try in the following code. i hope to check it and see if this is the best way of use. i am also trying to set the parameters in the main method, but it doesn't seem to work at all. for example it neither removes the diacritics nor removingTatweel.

       ArabicTokenizer.main(new string[] { "normArDigits", "normAlif", "normYa", "removeDiacritics", "removeTatweel", "removeProMarker", "removeSegMarker", "removeMorphMarker", "removeLengthening", "atbEscaping" });
        string s = textBox2.Text;
        java.io.StringReader sr = new StringReader(s);
        ArabicTokenizer tokenizer = new ArabicTokenizer(sr, new edu.stanford.nlp.process.WordTokenFactory(), new java.util.Properties());

        java.util.List al = tokenizer.tokenize();
        int size = al.size();
        string container = "";
        for (int i = 0; i < size; i++)
        {
           Word w = (Word)al.get(i);
           container = container + " ^ " + w.word();
        }
        textBox1.Text = container;

image

sergey-tihon commented 9 years ago

As I understand you need Stanford Word Segmenter that is designed for tokenization of Arabic and Chinese languages. Here you can find C#/F# samples for Stanford Word Segmenter for .NET Is it what you are trying to do?

alismart commented 9 years ago

I tried Stanford Word Segmenter, as its name implies, it is dividing the raw text to segments ( sentences ) depending on a training set which needs so many ram and cpu resources as i noticed, since it is using Machine Learning.

Later on,I noticed that Stanford has included another tool called: Arabic Tokenizer which is a straightforward algorithm for dividing raw Arabic text to tokens ( words ) . so i think my need is the Tokenizer (to words) instead of the Segmenter (to sentences).

My question is, do have any idea how to use the Tokenizer? especially how to set the parameters to vary its functionality ..

sergey-tihon commented 9 years ago

Please try following sample

using System;
using edu.stanford.nlp.international.arabic.process;
using edu.stanford.nlp.ling;
using edu.stanford.nlp.util;
using java.io;

namespace CoreNLPArabic
{
    class Program
    {
        static void Main(string[] args)
        {
            string s = "جامعة الدول العربية هي منظمة تضم دولا في الشرق الأوسط وأفريقيا";

            var parameters =
                new[]
                {
                    "normArDigits", "normAlif", "normYa", "removeDiacritics", "removeTatweel", "removeProMarker",
                    "removeSegMarker", "removeMorphMarker", "removeLengthening", "atbEscaping"
                };
            var tokenizerOptions = StringUtils.argsToProperties(parameters);
            var tf = tokenizerOptions.containsKey("atb")
                ? ArabicTokenizer.atbFactory()
                : ArabicTokenizer.factory();

            foreach (String option in tokenizerOptions.stringPropertyNames().toArray())
            {tf.setOptions(option);}
            tf.setOptions("tokenizeNLs");

            int nLines = 0;
            int nTokens = 0;
            var tokenizer = tf.getTokenizer(new StringReader(s));
            var printSpace = false;
            const string NEWLINE_TOKEN = "*NL*";
            while (tokenizer.hasNext()) {
              ++nTokens;
              var next = tokenizer.next() as CoreLabel;
              String word = next.word();
              if (word.Equals(NEWLINE_TOKEN)) {
                ++nLines;
                printSpace = false;
                System.Console.WriteLine();
              } else {
                if (printSpace) System.Console.Write(" ");
                System.Console.Write(word);
                printSpace = true;
              }
            }
            System.Console.WriteLine("\nDone! Tokenized %d lines (%d tokens)%n", nLines, nTokens);

        }
    }
}
alismart commented 9 years ago

Unfortunately, it didn't work. Obviously the tokenizer is not recognizing any of the options you provided in the code this is my try, the red words are what is expected after taking the options in consideration image

maybe there is still something required to get everything works properly.

given the following input: جامعةُ الدُوَلِ العـــــــــربية هي منظمة تضم دولا في الشرق الأوسط وأفريقيا the expected output is: جامعة الدول العربية هي منظمة تضم دولا في الشرق الاوسط وافريقيا

i hope you could fix the code as soon as possible because i need it in my graduation project thanks in advance

sergey-tihon commented 9 years ago

@alismart Sorry, I have no ideas. Could you try Java version? Would be nice to know if it is work as you expected or not.

saidMoulay commented 9 years ago

in your script program,replace this line scripte "foreach (String option in tokenizerOptions.stringPropertyNames().toArray())" with this one "foreach (String option in parameters)" . And the output text wil be as you hope (fine). with out Tatweel . . .

saidMoulay commented 7 years ago

Fixed code , Try it

using System;
using edu.stanford.nlp.international.arabic.process;
using edu.stanford.nlp.ling;
using edu.stanford.nlp.util;
using java.io;

namespace CoreNLPArabic
{
    class Program
    {
        static void Main(string[] args)
        {
            string s = "جامعة الدول العربية هي منظمة تضم دولا في الشرق الأوسط وأفريقيا";

            var parameters =
                new[]
                {
                    "normArDigits", "normAlif", "normYa", "removeDiacritics", "removeTatweel", "removeProMarker",
                    "removeSegMarker", "removeMorphMarker", "removeLengthening", "atbEscaping"
                };
            var tokenizerOptions = StringUtils.argsToProperties(parameters);
            var tf = tokenizerOptions.containsKey("atb")
                ? ArabicTokenizer.atbFactory()
                : ArabicTokenizer.factory();

            foreach (String option in parameters)
            {tf.setOptions(option);}
            tf.setOptions("tokenizeNLs");

            int nLines = 0;
            int nTokens = 0;
            var tokenizer = tf.getTokenizer(new StringReader(s));
            var printSpace = false;
            const string NEWLINE_TOKEN = "*NL*";
            while (tokenizer.hasNext()) {
              ++nTokens;
              var next = tokenizer.next() as CoreLabel;
              String word = next.word();
              if (word.Equals(NEWLINE_TOKEN)) {
                ++nLines;
                printSpace = false;
                System.Console.WriteLine();
              } else {
                if (printSpace) System.Console.Write(" ");
                System.Console.Write(word);
                printSpace = true;
              }
            }
            System.Console.WriteLine("\nDone! Tokenized %d lines (%d tokens)%n", nLines, nTokens);

        }
    }
}