sergey-tihon / Stanford.NLP.NET

Stanford NLP for .NET
http://sergey-tihon.github.io/Stanford.NLP.NET/
MIT License
595 stars 123 forks source link

Stanford NLP stuck when using large text #90

Closed Anupam750 closed 2 years ago

Anupam750 commented 5 years ago

Stanford NLP stuck when using large text for segmentor. Please check it.

sergey-tihon commented 5 years ago

Do you mean Stanford.NLP.Segmenter?

Update: https://stackoverflow.com/questions/9492707/how-can-i-split-a-text-into-sentences-using-the-stanford-parser

Anupam750 commented 5 years ago

Yes Stanford.NLP.Segmenter, I will check it out and let you know for the same.

Anupam750 commented 5 years ago

above code is giving compile time error as this code is in java so do we have any code in C#?

sergey-tihon commented 5 years ago

@Anupam750 not yet, sorry. But it should not be hard to convert Java sample to C# sample

Anupam750 commented 5 years ago

I did the code in C# and that code is giving error and that is related to nlp dll

sergey-tihon commented 5 years ago

Cannot help you without sample code and stack trace. You can try C# sample from the same SO answer

public class NlpDemo
{
    public static readonly TokenizerFactory TokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(),
                "normalizeParentheses=false,normalizeOtherBrackets=false,invertible=true");

    public void ParseFile(string fileName)
    {
        using (var stream = File.OpenRead(fileName))
        {
            SplitSentences(stream);
        }
    }

    public void SplitSentences(Stream stream)
    {            
        var preProcessor = new DocumentPreprocessor(new UTF8Reader(new InputStreamWrapper(stream)));
        preProcessor.setTokenizerFactory(TokenizerFactory);

        foreach (java.util.List sentence in preProcessor)
        {
            ProcessSentence(sentence);
        }            
    }

    // print the sentence with original spaces and punctuation.
    public void ProcessSentence(java.util.List sentence)
    {
        System.Console.WriteLine(edu.stanford.nlp.util.StringUtils.joinWithOriginalWhiteSpace(sentence));
    }
}

or as alternative use CoreNLP package and extract list of sentences from annotation https://github.com/sergey-tihon/Stanford.NLP.NET/blob/master/samples/Stanford.NLP.CoreNLP.CSharp/StanfordCoreNlpClient.cs#L35-L36

but I have no idea how large your text is.

Anupam750 commented 5 years ago

ok, I will try it out and let you know for the same.

FYI my text is around 300 pages pdf file

abhi18av commented 5 years ago

Hello everyone,

If this issue is resolved could you please update here the solution and explain how you resolved it ?

On Wed, Oct 24, 2018 at 11:10 PM Anupam750 notifications@github.com wrote:

my text is around 300 pages pdf file

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sergey-tihon/Stanford.NLP.NET/issues/90#issuecomment-432758254, or mute the thread https://github.com/notifications/unsubscribe-auth/AMNNXhOXfZQ2ilgeIQ_S8Z-pcUywfwgoks5uoKYjgaJpZM4X0_U7 .

sergey-tihon commented 2 years ago

Close as an old issue