sergey-tihon / Stanford.NLP.NET

Stanford NLP for .NET
http://sergey-tihon.github.io/Stanford.NLP.NET/
MIT License
595 stars 123 forks source link

POS-Tagging in other languages #106

Closed Rand0m-Guy closed 4 years ago

Rand0m-Guy commented 4 years ago

While checking the documentation, I noticed that the .NET port of the POS tagger only supports a couple of languages. I need it for a non-supported language (spanish, specifically). What I have been trying for some weeks now is to use a dll version of the jar file of the language supplied by Stanford CoreNLP, however it does not seem to work. When using var tagger = new MaxentTagger(@"C:/Users/myUser/Desktop/UnityProyect/Assets/Plugins/spanish-ud.tagger"); It throws

IOException: Unable to open "C:/Users/myUser/Desktop/UnityProyect/Assets/Plugins/spanish-ud.tagger" as class path, filename or URL
edu.stanford.nlp.io.IOUtils.getInputStreamFromURLOrClasspathOrFileSystem (System.String textFileOrUrl) (at <3882360035dc4dd4a6bff58e8ebe2d3f>:0)
edu.stanford.nlp.tagger.maxent.MaxentTagger.readModelAndInit (java.util.Properties config, System.String modelFileOrUrl, System.Boolean printLoading) (at <3882360035dc4dd4a6bff58e8ebe2d3f>:0)
Rethrow as RuntimeIOException: Error while loading a tagger model (probably missing model file)
edu.stanford.nlp.tagger.maxent.MaxentTagger.readModelAndInit (java.util.Properties config, System.String modelFileOrUrl, System.Boolean printLoading) (at <3882360035dc4dd4a6bff58e8ebe2d3f>:0)
edu.stanford.nlp.tagger.maxent.MaxentTagger..ctor (System.String modelFile, java.util.Properties config, System.Boolean printLoading) (at <3882360035dc4dd4a6bff58e8ebe2d3f>:0)
edu.stanford.nlp.tagger.maxent.MaxentTagger..ctor (System.String modelFile) (at <3882360035dc4dd4a6bff58e8ebe2d3f>:0)
Test.POS (System.String text) (at Assets/Scripts/Test.cs:69)
Test.Start () (at Assets/Scripts/Test.cs:34)

As a way of trying to experiment, also tried routing it to the original jar file, only for it to throw the same IOException. Also tried using "props.setProperty(...)" but, same error was thrown.

How should I use a Stanford CoreNLP language model with the .NET version?

sergey-tihon commented 4 years ago
  1. What NuGet Package do you use? Stanford.NLP.CoreNLP or Stanford.NLP.POSTagger? (What version)
  2. Do you have file C:/Users/myUser/Desktop/UnityProyect/Assets/Plugins/spanish-ud.tagger?
  3. Where you downloaded this file from?
  4. Plz share you code snippet
Rand0m-Guy commented 4 years ago

1.- I'm using Stanford.NLP.CoreNLP (3.9.2) downloaded from NuGet, which means in the Unity Assets Folder I'm referencing these Stanford dll's: image

And these IKVM dll's:

image

2.- Technically speaking... no. (Lemme explain lol) that was the reference used when I used the original Stanford CoreNLP in Java. What I have is the Spanish model provided by Stanford. I tried referencing it directly (maybe I did it in a wrong way), and I also have it converted to a dll via IKVM

3.- NuGet

4.- Sure:

using System.Collections;
using System.Collections.Generic;
using edu.stanford.nlp.ling;
using edu.stanford.nlp.pipeline;
using edu.stanford.nlp.tagger.maxent;
using java.io;
using java.util;
using UnityEngine;
using System;
using System.IO;

public class Test : MonoBehaviour
{

    StanfordCoreNLP scnlp;
    Properties props;
    string propname = "tokenize, ssplit, pos, lemma, ner";

    void Start()
    {
        props = new Properties();
        props.setProperty("annotators", propname);

        scnlp = new StanfordCoreNLP(props);

        string text = "Hola, esto es un texto en español.";    //Spanish text
        POS(text);
    }

    void Update()
    {

    }

    public void POS(string text) {
        var tagger = new MaxentTagger(@"C:/Users/myUser/Desktop/UnityProyect/Assets/Plugins/spanish-ud.tagger");
        var sentences = MaxentTagger.tokenizeText(new java.io.StringReader(text)).toArray();
        foreach (java.util.ArrayList sentence in sentences)
        {
            var taggedSentence = tagger.tagSentence(sentence);
            Debug.Log(SentenceUtils.listToString(taggedSentence, false));
        }
    }
}
sergey-tihon commented 4 years ago

Option 1

If you want to use var tagger = new MaxentTagger(@"C:/Users/myUser/Desktop/UnityProyect/Assets/Plugins/spanish-ud.tagger"); then you need file C:/Users/myUser/Desktop/UnityProyect/Assets/Plugins/spanish-ud.tagger physically available at this path.

Just unzip Java *.jar with Spanish model and provide full path to spanish-ud.tagger file. Also in this flow you do not need lines

        props = new Properties();
        props.setProperty("annotators", propname);

        scnlp = new StanfordCoreNLP(props);

because you do not use scnlp.

Option 2

You can go with StanfordCoreNLP and build annotators pipeline.

https://github.com/sergey-tihon/Stanford.NLP.NET/blob/v4/tests/Stanford.NLP.CoreNLP.Tests/CoreNlpTests.cs#L73-L93

then process your text and receive Annonation with all metadata extracted by annotators from your pipeline

https://github.com/sergey-tihon/Stanford.NLP.NET/blob/v4/tests/Stanford.NLP.CoreNLP.Tests/CoreNlpTests.cs#L95-L97

here is an example of how you can manually extract data from different kinds of annotations

https://github.com/sergey-tihon/Stanford.NLP.NET/blob/v4/tests/Stanford.NLP.CoreNLP.Tests/CoreNlpTests.cs#L106-L151

Rand0m-Guy commented 4 years ago

You are honestly the best. Thank you so so much!