kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.4k stars 444 forks source link

Create training - missing figure/table training data as compared with the upstream model (fulltext) #693

Closed lfoppiano closed 3 years ago

lfoppiano commented 3 years ago

I'm working on adding some more training data, at the moment I'm focusing on fulltext and segmentation, however I'm targerting figures and tables segmentation as well.

PR05514733-CC.pdf

In this case, although the fulltext is correctly recognising the figures, there is no training data for the figure model in output.

I've started digging and it seems a problem coming from FulltextParser.processTrainingDataFigures, where if there is only one figure (one I-

) the content seems just ignored. I was going to try to fix it, but i'm afraid I'm going to break something else. I've created a draft PR #694.

Here a test that tries to reproduce the problem:

        String text = "The mechanism for superconductivity FIG. 1. λ(T) vs . T for YBCO";
        List<LayoutToken> tokens = GrobidAnalyzer.getInstance().tokenizeWithLayoutToken(text);
        String rese = "The\tthe\tT\tTh\tThe\tThe\te\the\tThe\tThe\tBLOCKSTART\tLINESTART\tALIGNEDLEFT\tNEWFONT\tHIGHERFONT\t0\t0\tINITCAP\tNODIGIT\t0\tNOPUNCT\t0\t4\t0\tNUMBER\t0\t0\tI-<paragraph>\n" +
            "mechanism\tmechanism\tm\tme\tmec\tmech\tm\tsm\tism\tnism\tBLOCKIN\tLINEIN\tALIGNEDLEFT\tSAMEFONT\tSAMEFONTSIZE\t0\t0\tNOCAPS\tNODIGIT\t0\tNOPUNCT\t0\t4\t0\tNUMBER\t0\t0\t<paragraph>\n" +
            "for\tfor\tf\tfo\tfor\tfor\tr\tor\tfor\tfor\tBLOCKIN\tLINEIN\tALIGNEDLEFT\tSAMEFONT\tSAMEFONTSIZE\t0\t0\tNOCAPS\tNODIGIT\t0\tNOPUNCT\t0\t4\t0\tNUMBER\t0\t0\t<paragraph>\n" +
            "superconductivity\tsuperconductivity\ts\tsu\tsup\tsupe\ty\tty\tity\tvity\tBLOCKIN\tLINEIN\tALIGNEDLEFT\tSAMEFONT\tSAMEFONTSIZE\t0\t0\tNOCAPS\tNODIGIT\t0\tNOPUNCT\t0\t4\t0\tNUMBER\t0\t0\t<paragraph>" +
            "FIG\tfig\tF\tFI\tFIG\tFIG\tG\tIG\tFIG\tFIG\tBLOCKSTART\tLINESTART\tLINEINDENT\tNEWFONT\tHIGHERFONT\t0\t0\tALLCAP\tNODIGIT\t0\tNOPUNCT\t10\t3\t0\tNUMBER\t0\t0\tI-<figure>\n" +
            ".\t.\t.\t.\t.\t.\t.\t.\t.\t.\tBLOCKIN\tLINEIN\tLINEINDENT\tSAMEFONT\tSAMEFONTSIZE\t0\t0\tALLCAP\tNODIGIT\t1\tDOT\t10\t3\t0\tNUMBER\t0\t0\t<figure>\n" +
            "1\t1\t1\t1\t1\t1\t1\t1\t1\t1\tBLOCKIN\tLINEIN\tLINEINDENT\tSAMEFONT\tSAMEFONTSIZE\t0\t0\tNOCAPS\tALLDIGIT\t1\tNOPUNCT\t10\t3\t0\tNUMBER\t1\t0\t<figure>\n" +
            ".\t.\t.\t.\t.\t.\t.\t.\t.\t.\tBLOCKIN\tLINEIN\tLINEINDENT\tSAMEFONT\tSAMEFONTSIZE\t0\t0\tALLCAP\tNODIGIT\t1\tDOT\t10\t3\t0\tNUMBER\t0\t0\t<figure>\n" +
            "λ\tλ\tλ\tλ\tλ\tλ\tλ\tλ\tλ\tλ\tBLOCKIN\tLINEIN\tLINEINDENT\tNEWFONT\tSAMEFONTSIZE\t0\t0\tNOCAPS\tNODIGIT\t1\tNOPUNCT\t10\t3\t0\tNUMBER\t0\t0\t<figure>\n" +
            "(\t(\t(\t(\t(\t(\t(\t(\t(\t(\tBLOCKIN\tLINEIN\tLINEINDENT\tSAMEFONT\tSAMEFONTSIZE\t0\t0\tALLCAP\tNODIGIT\t1\tOPENBRACKET\t10\t3\t0\tNUMBER\t0\t0\t<figure>\n" +
            "T\tt\tT\tT\tT\tT\tT\tT\tT\tT\tBLOCKIN\tLINEIN\tLINEINDENT\tSAMEFONT\tSAMEFONTSIZE\t0\t0\tALLCAP\tNODIGIT\t1\tNOPUNCT\t10\t3\t0\tNUMBER\t0\t0\t<figure>\n" +
            ")\t)\t)\t)\t)\t)\t)\t)\t)\t)\tBLOCKIN\tLINEIN\tLINEINDENT\tNEWFONT\tSAMEFONTSIZE\t0\t0\tALLCAP\tNODIGIT\t1\tENDBRACKET\t10\t3\t0\tNUMBER\t0\t0\t<figure>\n" +
            "vs\tvs\tv\tvs\tvs\tvs\ts\tvs\tvs\tvs\tBLOCKIN\tLINEIN\tLINEINDENT\tSAMEFONT\tSAMEFONTSIZE\t0\t0\tNOCAPS\tNODIGIT\t0\tNOPUNCT\t10\t3\t0\tNUMBER\t0\t0\tI-<figure>\n" +
            ".\t.\t.\t.\t.\t.\t.\t.\t.\t.\tBLOCKIN\tLINEEND\tLINEINDENT\tSAMEFONT\tSAMEFONTSIZE\t0\t0\tALLCAP\tNODIGIT\t1\tDOT\t10\t3\t0\tNUMBER\t0\t0\t<figure>\n" +
            "T\tt\tT\tT\tT\tT\tT\tT\tT\tT\tBLOCKIN\tLINESTART\tLINEINDENT\tNEWFONT\tSAMEFONTSIZE\t0\t0\tALLCAP\tNODIGIT\t1\tNOPUNCT\t10\t3\t0\tNUMBER\t0\t0\t<figure>\n" +
            "for\tfor\tf\tfo\tfor\tfor\tr\tor\tfor\tfor\tBLOCKIN\tLINEIN\tLINEINDENT\tNEWFONT\tSAMEFONTSIZE\t0\t0\tNOCAPS\tNODIGIT\t0\tNOPUNCT\t10\t3\t0\tNUMBER\t0\t0\t<figure>\n" +
            "YBCO\tybco\tY\tYB\tYBC\tYBCO\tO\tCO\tBCO\tYBCO\tBLOCKIN\tLINEIN\tLINEINDENT\tSAMEFONT\tSAMEFONTSIZE\t0\t0\tALLCAP\tNODIGIT\t0\tNOPUNCT\t10\t3\t0\tNUMBER\t0\t0\t<figure>";

        Pair<String, String> stringStringPair = target.processTrainingDataFigures(rese, tokens, "123");

        System.out.println(stringStringPair.getLeft());
        System.out.println(stringStringPair.getRight());

And the result is quite strange:

  1. if there is only one <figure> block (e.g. line "vs\tvs\tv\tvs\tvs\tvs\ts\tvs\tvs\tvs\tBLOCKIN\tLINEIN\tLINEINDENT\tSAMEFONT\tSAMEFONTSIZE\t0\t0\tNOCAPS\tNODIGIT\t0\tNOPUNCT\t10\t3\t0\tNUMBER\t0\t0\t<figure>\n" + (does not have the I- prefix in the label, the output is empty,
  2. if there are two <figure> blocks, then the output is:
<tei>
    <teiHeader>
        <fileDesc xml:id="_123"/>
    </teiHeader>
    <text xml:lang="en">

        <figure>

            <figDesc>superconductivity . 1. λ(T)</figDesc>
        </figure>

    </text>
</tei>

superconductivity   superconductivity   s   su  sup supe    y   ty  ity vity    BLOCKIN LINEIN  ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   NOPUNCT 0   4   0   NUMBER  0   0   <paragraph>FIG  fig F   FI  FIG FIG G   IG  FIG FIG BLOCKSTART  LINESTART   LINEINDENT  NEWFONT HIGHERFONT  0   0   ALLCAP  NODIGIT 0   NOPUNCT 10  3   0   NUMBER  0   0
.   .   .   .   .   .   .   .   .   .   BLOCKIN LINEIN  LINEINDENT  SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 1   DOT 10  3   0   NUMBER  0   0
1   1   1   1   1   1   1   1   1   1   BLOCKIN LINEIN  LINEINDENT  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  ALLDIGIT    1   NOPUNCT 10  3   0   NUMBER  1   0
.   .   .   .   .   .   .   .   .   .   BLOCKIN LINEIN  LINEINDENT  SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 1   DOT 10  3   0   NUMBER  0   0
λ   λ   λ   λ   λ   λ   λ   λ   λ   λ   BLOCKIN LINEIN  LINEINDENT  NEWFONT SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 1   NOPUNCT 10  3   0   NUMBER  0   0
(   (   (   (   (   (   (   (   (   (   BLOCKIN LINEIN  LINEINDENT  SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 1   OPENBRACKET 10  3   0   NUMBER  0   0
T   t   T   T   T   T   T   T   T   T   BLOCKIN LINEIN  LINEINDENT  SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 1   NOPUNCT 10  3   0   NUMBER  0   0
)   )   )   )   )   )   )   )   )   )   BLOCKIN LINEIN  LINEINDENT  NEWFONT SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 1   ENDBRACKET  10  3   0   NUMBER  0   0

which does not look correct. Could be that I did not re-create a real case (I took an excerprt from the figure in the pdf I'm attaching).

lfoppiano commented 3 years ago

Here another case. There are four figures that are correctly identified from the fulltext parser, but only one appears in the figure training data:

PR05713422-CC.pdf fulltext and figure training data.zip

lfoppiano commented 3 years ago

One thing I forgot to mention is that to reproduce it on the data I provided is to see it from the branch features/add-training-data because otherwise the segmentation and fulltext extracvtion of the models on master are not getting the figure correctly.

lfoppiano commented 3 years ago

FYI I just merged the branch features/add-training-data into bugfix/fix-figure-table-training-data-generation so that the issue can be reproduced.

lfoppiano commented 3 years ago

Same issue for tables, the PR attempt to fix both cases.