stanfordnlp / CoreNLP

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.
http://stanfordnlp.github.io/CoreNLP/
GNU General Public License v3.0
9.68k stars 2.7k forks source link

boundaryMultiTokenRegex: regex recognises a wrong group #1078

Open rssdev10 opened 4 years ago

rssdev10 commented 4 years ago

Hello, I'm trying to split a list of sentences without a proper punctuation into separate sentences. And looks, an expression written in boundaryMultiTokenRegex is not working as expected. The key aspect here, I cannot split all the text by new line in all the cases because whole text might be with multi line sentences. But list items are with a key indicator of the start.

    @Test
    public void testTokenizeNLsInList() {
        String text = "This is a list:\n" +
                "- String one\n" +
                "- String two;\n" +
                "- String three";

        Properties props = PropertiesUtils.asProperties(
                "annotators", "tokenize, ssplit",
                "tokenize.options", "tokenizeNLs",
                "ssplit.boundaryMultiTokenRegex", "(/\\n|\\*NL\\*/) /[^[\\p{Alnum}'\"`!?.,]]/ /\\p{Lu}\\p{L}+/"
        );

        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

        Annotation document1 = new Annotation(text);
        pipeline.annotate(document1);
        List<CoreMap> sentences = document1.get(CoreAnnotations.SentencesAnnotation.class);
        assertEquals(4, sentences.size());

        // make sure that there are the correct # of tokens
        // (does NOT contain NL tokens)
        List<CoreLabel> tokens = document1.get(CoreAnnotations.TokensAnnotation.class);
        assertEquals(4, tokens.size());
    }

Expected behavior is to get 4 sentences:

This is a list:
- String one
- String two;
- String three

Real behavior:

This is a list:\n- String,
one\n- String
two;\n- String
three

Additional findings: I checked what is happening inside https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/process/WordToSentenceProcessor.java#L282

As a result here, I'm getting 3 tokens result instead of the first one mentioned in the pattern.

        List<? super IN> nodes = matcher.groupNodes();
        if (nodes != null && ! nodes.isEmpty()) {
          if (DEBUG) { log.info("    found match at: " + nodes); }
          isSentenceBoundary.put(nodes.get(nodes.size() - 1), true);
        }

Also, in the debugger, I see matcher with two groups found with all the tokens from the pattern, and with the correct first token only. But as the correct group is the second one, the final result is wrong. Screenshot 2020-08-01 at 17 04 03

Screenshot 2020-08-01 at 16 52 42

AngledLuffa commented 4 years ago

As far as I can tell, what you are trying to do is not the intended use of boundaryMultiTokenRegex. The documentation says

"The matched tokens will be treated as not part of the following sentence."

My understanding is, this means the matched tokens will be part of the first sentence. In other words, the behavior you are seeing is exactly the expected behavior.

What you could possibly do is add functionality to the WordToSentenceProcessor which includes a lookahead tokenregex. Alternatively, you could add a new annotator between ssplit and the other annotators which rearranges the sentences as needed. Although I might be missing something, I don't see any functionality that does exactly what you need right now.

rssdev10 commented 4 years ago

Ok, thanks, in that case definitely, I will think how to do a workaround.

I thought about the lookahead functionality. But my concern here is the mentioned fragment:

        List<? super IN> nodes = matcher.groupNodes();
        if (nodes != null && ! nodes.isEmpty()) {
          if (DEBUG) { log.info("    found match at: " + nodes); }
          isSentenceBoundary.put(nodes.get(nodes.size() - 1), true);
        }

Not sure that use of matcher.groupNodes() is correct. https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#cg says: Group zero always stands for the entire expression.

A common understanding of groups in regex is like

java.util.regex.Pattern pattern = java.util.regex.Pattern.compile("(b) c d");
java.util.regex.Matcher matcher = pattern.matcher("a b c d e");
java.util.regex.MatchResult result = matcher.toMatchResult();
if (matcher.find()) {
    System.out.println(matcher.group(1));
}

with b as the only result because (b) only was mentioned as a group.

See also: https://rubular.com/r/PGDnmY3IcAua0j

So, in that understanding of the group for matching, the lookahead tokenregex operation is not required.

rssdev10 commented 4 years ago

One of simple ways how to keep full compatibility with previous expressions - add some specific name of the group to do a line break. E.g. "ssplit.boundaryMultiTokenRegex", "(?$LN_BREAK /\\n|\\*NL\\*/) /[^[\\p{Alnum}'\"!?.,]]/ /\\p{Lu}\\p{L}+/", where $LN_BREAK is a the last removing group after which a break line happening. If there is no a group with that name, keep logic as now.

Any suggestions about it?

AngledLuffa commented 4 years ago

I was thinking that adding a new field would be a lot easier than trying to add more functionality to the existing boundaryMultiTokenRegex. Possibly two new fields, so we can specify which pieces go to the previous sentence and which go to the next sentence.