This is an error case not to forget that causes some trouble with the sentence segmentation.
The document is not CC-BY, referenced here: https://dx.doi.org/10.1063/1.1874292
Here the delinquent paragraph:
With version 0.8.0 and the current master, the process fails:
ERROR [2024-06-11 06:22:00,602] org.grobid.service.process.GrobidRestProcessFiles: An unexpected exception occurs.
! java.lang.StringIndexOutOfBoundsException: begin 592, end 595, length 594
! at java.base/java.lang.String.checkBoundsBeginEnd(String.java:4606)
! at java.base/java.lang.String.substring(String.java:2709)
! at org.grobid.core.document.TEIFormatter.segmentIntoSentences(TEIFormatter.java:1900)
! at org.grobid.core.document.TEIFormatter.toTEITextPiece(TEIFormatter.java:1468)
! at org.grobid.core.document.TEIFormatter.toTEIBody(TEIFormatter.java:1015)
! at org.grobid.core.engines.FullTextParser.toTEI(FullTextParser.java:2648)
! ... 83 common frames omitted
! Causing: org.grobid.core.exceptions.GrobidException: [GENERAL] An exception occurred while running Grobid.
String local_text_chunk = text.substring(pos+posInSentence, theSentences.get(i).end); may crash when the sentence is going over the text length
The if is completely ignored in certain cases, so all the accumulated nodes are dropped. See below:
<div
xmlns="http://www.tei-c.org/ns/1.0">
<head>C. dc field dependence of R "T , B rf , B dc , f…</head>
<p>
<s>As mentioned in Ref.</s>
<s>31, properly annealed, bulk Nb TM-TE-mode cavities show large additional rf losses by frozen-in flux with, e.g., at 4.2 K and 2 GHz, R H Ӎ 2 ⍀ H dc / mT for RRRӍ 30, which is described in Eq. ͑3.9͒ by  Ӎ 1 and  Ͻ 10 for RRRտ 200.</s>
<s>Those large rf losses by the normal conducting cores of slow AF do not increase with rf field level.</s>
<s>,
<ref type="bibr" target="#b30">31</ref>
</s>
</p>
</div>
This is an error case not to forget that causes some trouble with the sentence segmentation. The document is not CC-BY, referenced here: https://dx.doi.org/10.1063/1.1874292
Here the
delinquent
paragraph:With version 0.8.0 and the current master, the process fails:
There are two problems (code https://github.com/kermitt2/grobid/blob/694f0ed055e8c9a5d5efdc314ebef78e5e2640cf/grobid-core/src/main/java/org/grobid/core/document/TEIFormatter.java#L2028):
String local_text_chunk = text.substring(pos+posInSentence, theSentences.get(i).end);
may crash when the sentence is going over the text length