Open rssdev10 opened 4 years ago
As far as I can tell, what you are trying to do is not the intended use of boundaryMultiTokenRegex. The documentation says
"The matched tokens will be treated as not part of the following sentence."
My understanding is, this means the matched tokens will be part of the first sentence. In other words, the behavior you are seeing is exactly the expected behavior.
What you could possibly do is add functionality to the WordToSentenceProcessor which includes a lookahead tokenregex. Alternatively, you could add a new annotator between ssplit and the other annotators which rearranges the sentences as needed. Although I might be missing something, I don't see any functionality that does exactly what you need right now.
Ok, thanks, in that case definitely, I will think how to do a workaround.
I thought about the lookahead functionality. But my concern here is the mentioned fragment:
List<? super IN> nodes = matcher.groupNodes();
if (nodes != null && ! nodes.isEmpty()) {
if (DEBUG) { log.info(" found match at: " + nodes); }
isSentenceBoundary.put(nodes.get(nodes.size() - 1), true);
}
Not sure that use of matcher.groupNodes()
is correct. https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#cg says: Group zero always stands for the entire expression
.
A common understanding of groups in regex is like
java.util.regex.Pattern pattern = java.util.regex.Pattern.compile("(b) c d");
java.util.regex.Matcher matcher = pattern.matcher("a b c d e");
java.util.regex.MatchResult result = matcher.toMatchResult();
if (matcher.find()) {
System.out.println(matcher.group(1));
}
with b
as the only result because (b)
only was mentioned as a group.
See also: https://rubular.com/r/PGDnmY3IcAua0j
So, in that understanding of the group for matching, the lookahead tokenregex operation is not required.
One of simple ways how to keep full compatibility with previous expressions - add some specific name of the group to do a line break. E.g. "ssplit.boundaryMultiTokenRegex", "(?$LN_BREAK /\\n|\\*NL\\*/) /[^[\\p{Alnum}'\"!?.,]]/ /\\p{Lu}\\p{L}+/"
, where $LN_BREAK
is a the last removing group after which a break line happening. If there is no a group with that name, keep logic as now.
Any suggestions about it?
I was thinking that adding a new field would be a lot easier than trying to add more functionality to the existing boundaryMultiTokenRegex. Possibly two new fields, so we can specify which pieces go to the previous sentence and which go to the next sentence.
Hello, I'm trying to split a list of sentences without a proper punctuation into separate sentences. And looks, an expression written in
boundaryMultiTokenRegex
is not working as expected. The key aspect here, I cannot split all the text by new line in all the cases because whole text might be with multi line sentences. But list items are with a key indicator of the start.Expected behavior is to get 4 sentences:
Real behavior:
Additional findings: I checked what is happening inside https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/process/WordToSentenceProcessor.java#L282
As a result here, I'm getting 3 tokens result instead of the first one mentioned in the pattern.
Also, in the debugger, I see
matcher
with two groups found with all the tokens from the pattern, and with the correct first token only. But as the correct group is the second one, the final result is wrong.