kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.49k stars 449 forks source link

Line number error cases - first line number not removed #638

Closed de-code closed 4 years ago

de-code commented 4 years ago

Hi @kermitt2

I have now merged with upstream master and during evaluation I found some error cases where the line numbers are not filtered out.

I can confirm that the line numbers are removed for the example that @lfoppiano was using: https://doi.org/10.1101/2020.04.21.054221 (i.e. it looks like I am doing at least something right).

Here are some examples where it doesn't seem to work. It appears that the first line number (1 is not removed), but subsequent line numbers appear to be removed (I currently don't have a way to visualise the lxml for confirm that more easily). Thus the title is usually affected more.

Example 1

https://www.biorxiv.org/content/10.1101/210401v1?versioned=true

    <Page ID="Page1" PHYSICAL_IMG_NR="1" WIDTH="612.000" HEIGHT="792.000">
      <PrintSpace>
        <TextBlock ID="p1_b1" HPOS="302.930" VPOS="732.329" HEIGHT="10.2120" WIDTH="6.1382">
          <TextLine WIDTH="6.1382" HEIGHT="10.2120" ID="p1_t1" HPOS="302.930" VPOS="732.329">
            <String ID="p1_w1" CONTENT="1" HPOS="302.930" VPOS="732.329" WIDTH="6.1382" HEIGHT="10.2120" STYLEREFS="font0"/>
          </TextLine>
        </TextBlock>
        <TextBlock ID="p1_b2" HPOS="72.0240" VPOS="74.8640" HEIGHT="10.8000" WIDTH="468.196">
          <TextLine WIDTH="468.196" HEIGHT="10.8000" ID="p1_t2" HPOS="72.0240" VPOS="74.8640">
            <String ID="p1_w2" CONTENT="Combinatorial" HPOS="72.0240" VPOS="74.8640" WIDTH="75.2760" HEIGHT="10.8000" STYLEREFS="font1"/>
            <SP WIDTH="11.0640" VPOS="74.8640" HPOS="147.300"/>
            <String ID="p1_w3" CONTENT="effect" HPOS="158.364" VPOS="74.8640" WIDTH="27.9720" HEIGHT="10.8000" STYLEREFS="font1"/>
            <SP WIDTH="10.9920" VPOS="74.8640" HPOS="186.336"/>
            <String ID="p1_w4" CONTENT="of" HPOS="197.328" VPOS="74.8640" WIDTH="9.9960" HEIGHT="10.8000" STYLEREFS="font1"/>
            <SP WIDTH="11.1000" VPOS="74.8640" HPOS="207.324"/>
            <String ID="p1_w5" CONTENT="promoter" HPOS="218.424" VPOS="74.8640" WIDTH="48.5040" HEIGHT="10.8000" STYLEREFS="font1"/>
            <SP WIDTH="10.9800" VPOS="74.8640" HPOS="266.928"/>
            <String ID="p1_w6" CONTENT="activity," HPOS="277.908" VPOS="74.8640" WIDTH="41.0520" HEIGHT="10.8000" STYLEREFS="font1"/>
            <SP WIDTH="11.3500" VPOS="74.8640" HPOS="318.960"/>
            <String ID="p1_w7" CONTENT="mRNA" HPOS="330.310" VPOS="74.8640" WIDTH="35.7840" HEIGHT="10.8000" STYLEREFS="font1"/>
            <SP WIDTH="11.1120" VPOS="74.8640" HPOS="366.094"/>
            <String ID="p1_w8" CONTENT="degradation" HPOS="377.206" VPOS="74.8640" WIDTH="62.0760" HEIGHT="10.8000" STYLEREFS="font1"/>
            <SP WIDTH="11.0640" VPOS="74.8640" HPOS="439.282"/>
            <String ID="p1_w9" CONTENT="and" HPOS="450.346" VPOS="74.8640" WIDTH="19.3800" HEIGHT="10.8000" STYLEREFS="font1"/>
            <SP WIDTH="11.2140" VPOS="74.8640" HPOS="469.726"/>
            <String ID="p1_w10" CONTENT="site-specific" HPOS="480.940" VPOS="74.8640" WIDTH="59.2800" HEIGHT="10.8000" STYLEREFS="font1"/>
          </TextLine>
        </TextBlock>
        <TextBlock ID="p1_b3" HPOS="48.3600" VPOS="74.7800" HEIGHT="11.0400" WIDTH="5.5973">
          <TextLine WIDTH="5.5973" HEIGHT="11.0400" ID="p1_t3" HPOS="48.3600" VPOS="74.7800"/>
        </TextBlock>
        <TextBlock ID="p1_b4" HPOS="72.0240" VPOS="102.464" HEIGHT="10.8000" WIDTH="320.806">
          <TextLine WIDTH="320.806" HEIGHT="10.8000" ID="p1_t4" HPOS="72.0240" VPOS="102.464">
            <String ID="p1_w12" CONTENT="transcriptional" HPOS="72.0240" VPOS="102.464" WIDTH="76.6200" HEIGHT="10.8000" STYLEREFS="font1"/>
            <SP WIDTH="3.0560" VPOS="102.464" HPOS="148.644"/>

Example 2

https://doi.org/10.1101/440115

    <Page ID="Page1" PHYSICAL_IMG_NR="1" WIDTH="612.000" HEIGHT="792.000">
      <PrintSpace>
        <TextBlock ID="p1_b1" HPOS="303.212" VPOS="733.266" HEIGHT="9.9110" WIDTH="5.5770">
          <TextLine WIDTH="5.5770" HEIGHT="9.9110" ID="p1_t1" HPOS="303.212" VPOS="733.266">
            <String ID="p1_w1" CONTENT="1" HPOS="303.212" VPOS="733.266" WIDTH="5.5770" HEIGHT="9.9110" STYLEREFS="font0"/>
          </TextLine>
        </TextBlock>
        <TextBlock ID="p1_b2" HPOS="48.4250" VPOS="78.1190" HEIGHT="9.9110" WIDTH="5.5770">
          <TextLine WIDTH="5.5770" HEIGHT="9.9110" ID="p1_t2" HPOS="48.4250" VPOS="78.1190"/>
        </TextBlock>
        <TextBlock ID="p1_b3" HPOS="96.4080" VPOS="75.8030" HEIGHT="12.2920" WIDTH="419.174">
          <TextLine WIDTH="419.174" HEIGHT="12.2920" ID="p1_t3" HPOS="96.4080" VPOS="75.8030">
            <String ID="p1_w3" CONTENT="The" HPOS="96.4080" VPOS="75.8030" WIDTH="23.3520" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.4860" VPOS="75.8030" HPOS="119.760"/>
            <String ID="p1_w4" CONTENT="River" HPOS="123.246" VPOS="75.8030" WIDTH="33.4320" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.5000" VPOS="75.8030" HPOS="156.678"/>
            <String ID="p1_w5" CONTENT="Runs" HPOS="160.178" VPOS="75.8030" WIDTH="31.1360" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.5000" VPOS="75.8030" HPOS="191.314"/>
            <String ID="p1_w6" CONTENT="Through" HPOS="194.814" VPOS="75.8030" WIDTH="52.9060" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.5000" VPOS="75.8030" HPOS="247.720"/>
            <String ID="p1_w7" CONTENT="It:" HPOS="251.220" VPOS="75.8030" WIDTH="14.7700" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.5000" VPOS="75.8030" HPOS="265.990"/>
            <String ID="p1_w8" CONTENT="the" HPOS="269.490" VPOS="75.8030" WIDTH="18.6620" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.5000" VPOS="75.8030" HPOS="288.152"/>
            <String ID="p1_w9" CONTENT="Athabasca" HPOS="291.652" VPOS="75.8030" WIDTH="63.0140" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.5000" VPOS="75.8030" HPOS="354.666"/>
            <String ID="p1_w10" CONTENT="River" HPOS="358.166" VPOS="75.8030" WIDTH="33.4320" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.4860" VPOS="75.8030" HPOS="391.598"/>
            <String ID="p1_w11" CONTENT="Delivers" HPOS="395.084" VPOS="75.8030" WIDTH="48.9860" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.5000" VPOS="75.8030" HPOS="444.070"/>
            <String ID="p1_w12" CONTENT="Mercury" HPOS="447.570" VPOS="75.8030" WIDTH="52.8500" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.5000" VPOS="75.8030" HPOS="500.420"/>
            <String ID="p1_w13" CONTENT="to" HPOS="503.920" VPOS="75.8030" WIDTH="11.6620" HEIGHT="12.2920" STYLEREFS="font1"/>
          </TextLine>
        </TextBlock>
        <TextBlock ID="p1_b4" HPOS="48.4250" VPOS="96.6320" HEIGHT="9.9110" WIDTH="5.5770">
          <TextLine WIDTH="5.5770" HEIGHT="9.9110" ID="p1_t4" HPOS="48.4250" VPOS="96.6320"/>
        </TextBlock>
        <TextBlock ID="p1_b5" HPOS="182.732" VPOS="94.3160" HEIGHT="12.2920" WIDTH="246.540">
          <TextLine WIDTH="246.540" HEIGHT="12.2920" ID="p1_t5" HPOS="182.732" VPOS="94.3160">
            <String ID="p1_w15" CONTENT="Aquatic" HPOS="182.732" VPOS="94.3160" WIDTH="47.4460" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.5000" VPOS="94.3160" HPOS="230.178"/>
            <String ID="p1_w16" CONTENT="Birds" HPOS="233.678" VPOS="94.3160" WIDTH="32.6760" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.5000" VPOS="94.3160" HPOS="266.354"/>
            <String ID="p1_w17" CONTENT="Breeding" HPOS="269.854" VPOS="94.3160" WIDTH="54.4460" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.5000" VPOS="94.3160" HPOS="324.300"/>
            <String ID="p1_w18" CONTENT="Far" HPOS="327.800" VPOS="94.3160" WIDTH="21.7700" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.4860" VPOS="94.3160" HPOS="349.570"/>
            <String ID="p1_w19" CONTENT="Downstream" HPOS="353.056" VPOS="94.3160" WIDTH="76.2160" HEIGHT="12.2920" STYLEREFS="font1"/>
          </TextLine>
        </TextBlock>
        <TextBlock ID="p1_b6" HPOS="48.4250" VPOS="123.277" HEIGHT="9.9110" WIDTH="5.5770">
          <TextLine WIDTH="5.5770" HEIGHT="9.9110" ID="p1_t6" HPOS="48.4250" VPOS="123.277"/>
        </TextBlock>
        <TextBlock ID="p1_b7" HPOS="48.4250" VPOS="149.145" HEIGHT="9.9110" WIDTH="5.5770">
          <TextLine WIDTH="5.5770" HEIGHT="9.9110" ID="p1_t7" HPOS="48.4250" VPOS="149.145"/>
        </TextBlock>

Example 3

https://doi.org/10.1101/434563

    <Page ID="Page1" PHYSICAL_IMG_NR="1" WIDTH="612.000" HEIGHT="792.000">
      <PrintSpace>
        <TextBlock ID="p1_b1" HPOS="516.000" VPOS="745.510" HEIGHT="10.5360" WIDTH="6.0000">
          <TextLine WIDTH="6.0000" HEIGHT="10.5360" ID="p1_t1" HPOS="516.000" VPOS="745.510">
            <String ID="p1_w1" CONTENT="1" HPOS="516.000" VPOS="745.510" WIDTH="6.0000" HEIGHT="10.5360" STYLEREFS="font0"/>
          </TextLine>
        </TextBlock>
        <TextBlock ID="p1_b2" HPOS="67.0000" VPOS="81.6980" HEIGHT="8.7800" WIDTH="5.0000">
          <TextLine WIDTH="5.0000" HEIGHT="8.7800" ID="p1_t2" HPOS="67.0000" VPOS="81.6980"/>
        </TextBlock>
        <TextBlock ID="p1_b3" HPOS="89.9960" VPOS="80.4420" HEIGHT="10.2080" WIDTH="231.088">
          <TextLine WIDTH="231.088" HEIGHT="10.2080" ID="p1_t3" HPOS="89.9960" VPOS="80.4420">
            <String ID="p1_w3" CONTENT="Schlafen" HPOS="89.9960" VPOS="80.4420" WIDTH="45.8480" HEIGHT="10.2080" STYLEREFS="font2"/>
            <SP WIDTH="3.0580" VPOS="80.4420" HPOS="135.844"/>
            <String ID="p1_w4" CONTENT="11" HPOS="138.902" VPOS="80.4420" WIDTH="12.2320" HEIGHT="10.2080" STYLEREFS="font2"/>
            <SP WIDTH="3.0580" VPOS="80.4420" HPOS="151.134"/>
            <String ID="p1_w5" CONTENT="Restricts" HPOS="154.192" VPOS="80.4420" WIDTH="47.0800" HEIGHT="10.2080" STYLEREFS="font2"/>
            <SP WIDTH="3.0580" VPOS="80.4420" HPOS="201.272"/>
            <String ID="p1_w6" CONTENT="Flavivirus" HPOS="204.330" VPOS="80.4420" WIDTH="51.3590" HEIGHT="10.2080" STYLEREFS="font2"/>
            <SP WIDTH="3.0580" VPOS="80.4420" HPOS="255.689"/>
            <String ID="p1_w7" CONTENT="Replication." HPOS="258.747" VPOS="80.4420" WIDTH="62.3370" HEIGHT="10.2080" STYLEREFS="font2"/>
          </TextLine>
        </TextBlock>
        <TextBlock ID="p1_b4" HPOS="67.0000" VPOS="112.996" HEIGHT="8.7800" WIDTH="5.0000">
          <TextLine WIDTH="5.0000" HEIGHT="8.7800" ID="p1_t4" HPOS="67.0000" VPOS="112.996"/>
        </TextBlock>
        <TextBlock ID="p1_b5" HPOS="67.0000" VPOS="138.294" HEIGHT="8.7800" WIDTH="5.0000">
          <TextLine WIDTH="5.0000" HEIGHT="8.7800" ID="p1_t5" HPOS="67.0000" VPOS="138.294"/>
        </TextBlock>
        <TextBlock ID="p1_b6" HPOS="89.9960" VPOS="137.038" HEIGHT="10.2080" WIDTH="419.244">
          <TextLine WIDTH="419.244" HEIGHT="10.2080" ID="p1_t6" HPOS="89.9960" VPOS="136.569">
            <String ID="p1_w10" CONTENT="Federico" HPOS="89.9960" VPOS="137.038" WIDTH="42.8010" HEIGHT="10.2080" STYLEREFS="font3"/>
            <SP WIDTH="3.0580" VPOS="137.038" HPOS="132.797"/>
            <String ID="p1_w11" CONTENT="Valdez" HPOS="135.855" VPOS="137.038" WIDTH="33.6270" HEIGHT="10.2080" STYLEREFS="font3"/>
            <String ID="p1_w12" CONTENT="a" HPOS="169.487" VPOS="136.569" WIDTH="3.8920" HEIGHT="6.4960" STYLEREFS="font4"/>
            <String ID="p1_w13" CONTENT="," HPOS="173.380" VPOS="137.038" WIDTH="3.0580" HEIGHT="10.2080" STYLEREFS="font3"/>
            <SP WIDTH="3.0580" VPOS="137.038" HPOS="176.438"/>
            <String ID="p1_w14" CONTENT="Julienne" HPOS="179.496" VPOS="137.038" WIDTH="40.9750" HEIGHT="10.2080" STYLEREFS="font3"/>
            <SP WIDTH="3.0580" VPOS="137.038" HPOS="220.471"/>
            <String ID="p1_w15" CONTENT="Salvador" HPOS="223.529" VPOS="137.038" WIDTH="43.4060" HEIGHT="10.2080" STYLEREFS="font3"/>
            <SP WIDTH="1.9500" VPOS="137.038" HPOS="266.935"/>
kermitt2 commented 4 years ago

Thank you @de-code for the error cases, which are very useful!

The remaining line numbers not filtered by pdfalto don't appear to affect Grobid. When running these PDF, all the titles are correct because the lone remaining line number is neutralized by the header model. One exception is the third example, in particular the abstract, where some lower line numbers still appear in the text (it's strange that the whole line column is not filtered out in pdfalto given how it works now, but it might be related to other problems that impact the current mechanism).

This issue would be rather for pdfalto, now that line numbers are entirely tackled by pdfalto, not by Grobid.

If you find more, don't hesitate to share them, that will be very helpful to drive the next work iteration on pdfalto !

de-code commented 4 years ago

This issue would be rather for pdfalto, now that line numbers are entirely tackled by pdfalto, not by Grobid.

I did indeed intend to create the issue against pdfalto but didn't pay enough attention. I moved it over https://github.com/kermitt2/pdfalto/issues/101