kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.42k stars 444 forks source link

Sentence segmentation error case #1130

Open lfoppiano opened 3 months ago

lfoppiano commented 3 months ago

This is an error case not to forget that causes some trouble with the sentence segmentation. The document is not CC-BY, referenced here: https://dx.doi.org/10.1063/1.1874292

Here the delinquent paragraph:

image

With version 0.8.0 and the current master, the process fails:

ERROR [2024-06-11 06:22:00,602] org.grobid.service.process.GrobidRestProcessFiles: An unexpected exception occurs. 
! java.lang.StringIndexOutOfBoundsException: begin 592, end 595, length 594
! at java.base/java.lang.String.checkBoundsBeginEnd(String.java:4606)
! at java.base/java.lang.String.substring(String.java:2709)
! at org.grobid.core.document.TEIFormatter.segmentIntoSentences(TEIFormatter.java:1900)
! at org.grobid.core.document.TEIFormatter.toTEITextPiece(TEIFormatter.java:1468)
! at org.grobid.core.document.TEIFormatter.toTEIBody(TEIFormatter.java:1015)
! at org.grobid.core.engines.FullTextParser.toTEI(FullTextParser.java:2648)
! ... 83 common frames omitted
! Causing: org.grobid.core.exceptions.GrobidException: [GENERAL] An exception occurred while running Grobid.

There are two problems (code https://github.com/kermitt2/grobid/blob/694f0ed055e8c9a5d5efdc314ebef78e5e2640cf/grobid-core/src/main/java/org/grobid/core/document/TEIFormatter.java#L2028):

  1. String local_text_chunk = text.substring(pos+posInSentence, theSentences.get(i).end); may crash when the sentence is going over the text length
  2. The if is completely ignored in certain cases, so all the accumulated nodes are dropped. See below:
<div
                xmlns="http://www.tei-c.org/ns/1.0">
                <head>C. dc field dependence of R "T , B rf , B dc , f…</head>
                <p>
                    <s>As mentioned in Ref.</s>
                    <s>31, properly annealed, bulk Nb TM-TE-mode cavities show large additional rf losses by frozen-in flux with, e.g., at 4.2 K and 2 GHz, R H Ӎ 2 ⍀ H dc / mT for RRRӍ 30, which is described in Eq. ͑3.9͒ by ␤ Ӎ 1 and ␤ Ͻ 10 for RRRտ 200.</s>
                    <s>Those large rf losses by the normal conducting cores of slow AF do not increase with rf field level.</s>
                    <s>,
                        <ref type="bibr" target="#b30">31</ref>
                    </s>
                </p>
            </div>
lfoppiano commented 3 months ago

This is normally fixed in #1131.