sergey-tihon / Stanford.NLP.NET

Stanford NLP for .NET
http://sergey-tihon.github.io/Stanford.NLP.NET/
MIT License
595 stars 123 forks source link

Tests for other Langauges #126

Closed GeorgeS2019 closed 6 months ago

GeorgeS2019 commented 1 year ago

Currently going through Parser using model 4.5.1 version provided for German

//ParserTests.cs
[Test]
public void ParseEasySentence()
{
     //All steps prior to this work!

     var gs = gsf.newGrammaticalStructure(parse);
}
java.lang.IllegalArgumentException: 'No head rule defined for NUR using class edu.stanford.nlp.trees.UniversalSemanticHeadFinder in (NUR
  (S (PROPN Christian) (AUX ist)
    (NP (PRON mein) (NOUN Freund)))
  (PUNCT .))

Potentially relevant issue: No head rule defined for IP using class edu.stanford.nlp.trees.SemanticHeadFinder

GeorgeS2019 commented 1 year ago

https://github.com/stanfordnlp/CoreNLP/issues/1227

I know it is not part of the scope. It would be great if you could get the German language using e.g. the following example.

public class TestSatzErkennung
{

    public static String text = "Marie was born in Paris. Marie wurde in Paris geboren.";

    public static void main(String[] args) 
    {
        // set up pipeline properties
        Properties props = new Properties();
        // set the list of annotators to run
//      props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner");//"tokenize,ssplit,pos,lemma");
//      props.setProperty("pos.model", "edu/stanford/nlp/models/pos-tagger/german-ud.tagger");
//      props.setProperty("tokenize.language", "German");
//      props.setProperty("ner.model", "edu/stanford/nlp/models/ner/german.distsim.crf.ser.gz");

        props.setProperty("annotators" ," tokenize, ssplit, mwt, pos, ner, depparse");
        props.setProperty("tokenize.language" , "de");
        props.setProperty("tokenize.postProcessor" , "edu.stanford.nlp.international.german.process.GermanTokenizerPostProcessor");

        props.setProperty("mwt.mappingFile" , "edu/stanford/nlp/models/mwt/german/german-mwt.tsv");

        props.setProperty("pos.model" , "edu/stanford/nlp/models/pos-tagger/german-ud.tagger");

        props.setProperty("ner.model" , "edu/stanford/nlp/models/ner/german.distsim.crf.ser.gz");
        props.setProperty("ner.applyNumericClassifiers" , "false");
        props.setProperty("ner.applyFineGrained" , "false");
        props.setProperty("ner.useSUTime" , "false");

        props.setProperty("parse.model" , "edu/stanford/nlp/models/srparser/germanSR.beam.ser.gz");
        props.setProperty("depparse.model" , "edu/stanford/nlp/models/parser/nndep/UD_German.gz");
        // build pipeline
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
        // create a document object
        CoreDocument document = pipeline.processToCoreDocument(text);

        for(CoreSentence sentence : document.sentences())
        {
            System.out.println(sentence);

            // display tokens
            for (CoreLabel tok : sentence.tokens()) 
            {
                System.out.println(String.format("%s\t%s\t%s\t%s\t%b", tok.word(), tok.lemma(), tok.tag(), tok.ner(), tok.isMWT()));
            }

            for(SemanticGraphEdge s : sentence.dependencyParse().edgeIterable())
            {
                System.out.println(s);
            }
        }
    }
}
sergey-tihon commented 1 year ago

I am happy to merge test that check that German works as expected, especially if you have working sample.

GeorgeS2019 commented 1 year ago

@sergey-tihon Good to hear that. Searching the internet, most users complained about the German language (especially the dependency parsing, which is the most critical as OpenNLP has no such features), most likely the least tested, so it is good we have you a second look :-)

GeorgeS2019 commented 6 months ago

This is solved now