kermitt2 / pdfalto

PDF to XML ALTO file converter
GNU General Public License v2.0
209 stars 67 forks source link

Line number error cases - first line number not removed #101

Open de-code opened 4 years ago

de-code commented 4 years ago

Hi @kermitt2

I have now merged with upstream master and during evaluation I found some error cases where the line numbers are not filtered out.

I can confirm that the line numbers are removed for the example that @lfoppiano was using: https://doi.org/10.1101/2020.04.21.054221 (i.e. it looks like I am doing at least something right).

Here are some examples where it doesn't seem to work. It appears that the first line number (1 is not removed), but subsequent line numbers appear to be removed (I currently don't have a way to visualise the lxml for confirm that more easily). Thus the title is usually affected more.

Example 1

https://www.biorxiv.org/content/10.1101/210401v1?versioned=true

    <Page ID="Page1" PHYSICAL_IMG_NR="1" WIDTH="612.000" HEIGHT="792.000">
      <PrintSpace>
        <TextBlock ID="p1_b1" HPOS="302.930" VPOS="732.329" HEIGHT="10.2120" WIDTH="6.1382">
          <TextLine WIDTH="6.1382" HEIGHT="10.2120" ID="p1_t1" HPOS="302.930" VPOS="732.329">
            <String ID="p1_w1" CONTENT="1" HPOS="302.930" VPOS="732.329" WIDTH="6.1382" HEIGHT="10.2120" STYLEREFS="font0"/>
          </TextLine>
        </TextBlock>
        <TextBlock ID="p1_b2" HPOS="72.0240" VPOS="74.8640" HEIGHT="10.8000" WIDTH="468.196">
          <TextLine WIDTH="468.196" HEIGHT="10.8000" ID="p1_t2" HPOS="72.0240" VPOS="74.8640">
            <String ID="p1_w2" CONTENT="Combinatorial" HPOS="72.0240" VPOS="74.8640" WIDTH="75.2760" HEIGHT="10.8000" STYLEREFS="font1"/>
            <SP WIDTH="11.0640" VPOS="74.8640" HPOS="147.300"/>
            <String ID="p1_w3" CONTENT="effect" HPOS="158.364" VPOS="74.8640" WIDTH="27.9720" HEIGHT="10.8000" STYLEREFS="font1"/>
            <SP WIDTH="10.9920" VPOS="74.8640" HPOS="186.336"/>
            <String ID="p1_w4" CONTENT="of" HPOS="197.328" VPOS="74.8640" WIDTH="9.9960" HEIGHT="10.8000" STYLEREFS="font1"/>
            <SP WIDTH="11.1000" VPOS="74.8640" HPOS="207.324"/>
            <String ID="p1_w5" CONTENT="promoter" HPOS="218.424" VPOS="74.8640" WIDTH="48.5040" HEIGHT="10.8000" STYLEREFS="font1"/>
            <SP WIDTH="10.9800" VPOS="74.8640" HPOS="266.928"/>
            <String ID="p1_w6" CONTENT="activity," HPOS="277.908" VPOS="74.8640" WIDTH="41.0520" HEIGHT="10.8000" STYLEREFS="font1"/>
            <SP WIDTH="11.3500" VPOS="74.8640" HPOS="318.960"/>
            <String ID="p1_w7" CONTENT="mRNA" HPOS="330.310" VPOS="74.8640" WIDTH="35.7840" HEIGHT="10.8000" STYLEREFS="font1"/>
            <SP WIDTH="11.1120" VPOS="74.8640" HPOS="366.094"/>
            <String ID="p1_w8" CONTENT="degradation" HPOS="377.206" VPOS="74.8640" WIDTH="62.0760" HEIGHT="10.8000" STYLEREFS="font1"/>
            <SP WIDTH="11.0640" VPOS="74.8640" HPOS="439.282"/>
            <String ID="p1_w9" CONTENT="and" HPOS="450.346" VPOS="74.8640" WIDTH="19.3800" HEIGHT="10.8000" STYLEREFS="font1"/>
            <SP WIDTH="11.2140" VPOS="74.8640" HPOS="469.726"/>
            <String ID="p1_w10" CONTENT="site-specific" HPOS="480.940" VPOS="74.8640" WIDTH="59.2800" HEIGHT="10.8000" STYLEREFS="font1"/>
          </TextLine>
        </TextBlock>
        <TextBlock ID="p1_b3" HPOS="48.3600" VPOS="74.7800" HEIGHT="11.0400" WIDTH="5.5973">
          <TextLine WIDTH="5.5973" HEIGHT="11.0400" ID="p1_t3" HPOS="48.3600" VPOS="74.7800"/>
        </TextBlock>
        <TextBlock ID="p1_b4" HPOS="72.0240" VPOS="102.464" HEIGHT="10.8000" WIDTH="320.806">
          <TextLine WIDTH="320.806" HEIGHT="10.8000" ID="p1_t4" HPOS="72.0240" VPOS="102.464">
            <String ID="p1_w12" CONTENT="transcriptional" HPOS="72.0240" VPOS="102.464" WIDTH="76.6200" HEIGHT="10.8000" STYLEREFS="font1"/>
            <SP WIDTH="3.0560" VPOS="102.464" HPOS="148.644"/>

Example 2

https://doi.org/10.1101/440115

    <Page ID="Page1" PHYSICAL_IMG_NR="1" WIDTH="612.000" HEIGHT="792.000">
      <PrintSpace>
        <TextBlock ID="p1_b1" HPOS="303.212" VPOS="733.266" HEIGHT="9.9110" WIDTH="5.5770">
          <TextLine WIDTH="5.5770" HEIGHT="9.9110" ID="p1_t1" HPOS="303.212" VPOS="733.266">
            <String ID="p1_w1" CONTENT="1" HPOS="303.212" VPOS="733.266" WIDTH="5.5770" HEIGHT="9.9110" STYLEREFS="font0"/>
          </TextLine>
        </TextBlock>
        <TextBlock ID="p1_b2" HPOS="48.4250" VPOS="78.1190" HEIGHT="9.9110" WIDTH="5.5770">
          <TextLine WIDTH="5.5770" HEIGHT="9.9110" ID="p1_t2" HPOS="48.4250" VPOS="78.1190"/>
        </TextBlock>
        <TextBlock ID="p1_b3" HPOS="96.4080" VPOS="75.8030" HEIGHT="12.2920" WIDTH="419.174">
          <TextLine WIDTH="419.174" HEIGHT="12.2920" ID="p1_t3" HPOS="96.4080" VPOS="75.8030">
            <String ID="p1_w3" CONTENT="The" HPOS="96.4080" VPOS="75.8030" WIDTH="23.3520" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.4860" VPOS="75.8030" HPOS="119.760"/>
            <String ID="p1_w4" CONTENT="River" HPOS="123.246" VPOS="75.8030" WIDTH="33.4320" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.5000" VPOS="75.8030" HPOS="156.678"/>
            <String ID="p1_w5" CONTENT="Runs" HPOS="160.178" VPOS="75.8030" WIDTH="31.1360" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.5000" VPOS="75.8030" HPOS="191.314"/>
            <String ID="p1_w6" CONTENT="Through" HPOS="194.814" VPOS="75.8030" WIDTH="52.9060" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.5000" VPOS="75.8030" HPOS="247.720"/>
            <String ID="p1_w7" CONTENT="It:" HPOS="251.220" VPOS="75.8030" WIDTH="14.7700" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.5000" VPOS="75.8030" HPOS="265.990"/>
            <String ID="p1_w8" CONTENT="the" HPOS="269.490" VPOS="75.8030" WIDTH="18.6620" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.5000" VPOS="75.8030" HPOS="288.152"/>
            <String ID="p1_w9" CONTENT="Athabasca" HPOS="291.652" VPOS="75.8030" WIDTH="63.0140" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.5000" VPOS="75.8030" HPOS="354.666"/>
            <String ID="p1_w10" CONTENT="River" HPOS="358.166" VPOS="75.8030" WIDTH="33.4320" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.4860" VPOS="75.8030" HPOS="391.598"/>
            <String ID="p1_w11" CONTENT="Delivers" HPOS="395.084" VPOS="75.8030" WIDTH="48.9860" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.5000" VPOS="75.8030" HPOS="444.070"/>
            <String ID="p1_w12" CONTENT="Mercury" HPOS="447.570" VPOS="75.8030" WIDTH="52.8500" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.5000" VPOS="75.8030" HPOS="500.420"/>
            <String ID="p1_w13" CONTENT="to" HPOS="503.920" VPOS="75.8030" WIDTH="11.6620" HEIGHT="12.2920" STYLEREFS="font1"/>
          </TextLine>
        </TextBlock>
        <TextBlock ID="p1_b4" HPOS="48.4250" VPOS="96.6320" HEIGHT="9.9110" WIDTH="5.5770">
          <TextLine WIDTH="5.5770" HEIGHT="9.9110" ID="p1_t4" HPOS="48.4250" VPOS="96.6320"/>
        </TextBlock>
        <TextBlock ID="p1_b5" HPOS="182.732" VPOS="94.3160" HEIGHT="12.2920" WIDTH="246.540">
          <TextLine WIDTH="246.540" HEIGHT="12.2920" ID="p1_t5" HPOS="182.732" VPOS="94.3160">
            <String ID="p1_w15" CONTENT="Aquatic" HPOS="182.732" VPOS="94.3160" WIDTH="47.4460" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.5000" VPOS="94.3160" HPOS="230.178"/>
            <String ID="p1_w16" CONTENT="Birds" HPOS="233.678" VPOS="94.3160" WIDTH="32.6760" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.5000" VPOS="94.3160" HPOS="266.354"/>
            <String ID="p1_w17" CONTENT="Breeding" HPOS="269.854" VPOS="94.3160" WIDTH="54.4460" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.5000" VPOS="94.3160" HPOS="324.300"/>
            <String ID="p1_w18" CONTENT="Far" HPOS="327.800" VPOS="94.3160" WIDTH="21.7700" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.4860" VPOS="94.3160" HPOS="349.570"/>
            <String ID="p1_w19" CONTENT="Downstream" HPOS="353.056" VPOS="94.3160" WIDTH="76.2160" HEIGHT="12.2920" STYLEREFS="font1"/>
          </TextLine>
        </TextBlock>
        <TextBlock ID="p1_b6" HPOS="48.4250" VPOS="123.277" HEIGHT="9.9110" WIDTH="5.5770">
          <TextLine WIDTH="5.5770" HEIGHT="9.9110" ID="p1_t6" HPOS="48.4250" VPOS="123.277"/>
        </TextBlock>
        <TextBlock ID="p1_b7" HPOS="48.4250" VPOS="149.145" HEIGHT="9.9110" WIDTH="5.5770">
          <TextLine WIDTH="5.5770" HEIGHT="9.9110" ID="p1_t7" HPOS="48.4250" VPOS="149.145"/>
        </TextBlock>

Example 3

https://doi.org/10.1101/434563

    <Page ID="Page1" PHYSICAL_IMG_NR="1" WIDTH="612.000" HEIGHT="792.000">
      <PrintSpace>
        <TextBlock ID="p1_b1" HPOS="516.000" VPOS="745.510" HEIGHT="10.5360" WIDTH="6.0000">
          <TextLine WIDTH="6.0000" HEIGHT="10.5360" ID="p1_t1" HPOS="516.000" VPOS="745.510">
            <String ID="p1_w1" CONTENT="1" HPOS="516.000" VPOS="745.510" WIDTH="6.0000" HEIGHT="10.5360" STYLEREFS="font0"/>
          </TextLine>
        </TextBlock>
        <TextBlock ID="p1_b2" HPOS="67.0000" VPOS="81.6980" HEIGHT="8.7800" WIDTH="5.0000">
          <TextLine WIDTH="5.0000" HEIGHT="8.7800" ID="p1_t2" HPOS="67.0000" VPOS="81.6980"/>
        </TextBlock>
        <TextBlock ID="p1_b3" HPOS="89.9960" VPOS="80.4420" HEIGHT="10.2080" WIDTH="231.088">
          <TextLine WIDTH="231.088" HEIGHT="10.2080" ID="p1_t3" HPOS="89.9960" VPOS="80.4420">
            <String ID="p1_w3" CONTENT="Schlafen" HPOS="89.9960" VPOS="80.4420" WIDTH="45.8480" HEIGHT="10.2080" STYLEREFS="font2"/>
            <SP WIDTH="3.0580" VPOS="80.4420" HPOS="135.844"/>
            <String ID="p1_w4" CONTENT="11" HPOS="138.902" VPOS="80.4420" WIDTH="12.2320" HEIGHT="10.2080" STYLEREFS="font2"/>
            <SP WIDTH="3.0580" VPOS="80.4420" HPOS="151.134"/>
            <String ID="p1_w5" CONTENT="Restricts" HPOS="154.192" VPOS="80.4420" WIDTH="47.0800" HEIGHT="10.2080" STYLEREFS="font2"/>
            <SP WIDTH="3.0580" VPOS="80.4420" HPOS="201.272"/>
            <String ID="p1_w6" CONTENT="Flavivirus" HPOS="204.330" VPOS="80.4420" WIDTH="51.3590" HEIGHT="10.2080" STYLEREFS="font2"/>
            <SP WIDTH="3.0580" VPOS="80.4420" HPOS="255.689"/>
            <String ID="p1_w7" CONTENT="Replication." HPOS="258.747" VPOS="80.4420" WIDTH="62.3370" HEIGHT="10.2080" STYLEREFS="font2"/>
          </TextLine>
        </TextBlock>
        <TextBlock ID="p1_b4" HPOS="67.0000" VPOS="112.996" HEIGHT="8.7800" WIDTH="5.0000">
          <TextLine WIDTH="5.0000" HEIGHT="8.7800" ID="p1_t4" HPOS="67.0000" VPOS="112.996"/>
        </TextBlock>
        <TextBlock ID="p1_b5" HPOS="67.0000" VPOS="138.294" HEIGHT="8.7800" WIDTH="5.0000">
          <TextLine WIDTH="5.0000" HEIGHT="8.7800" ID="p1_t5" HPOS="67.0000" VPOS="138.294"/>
        </TextBlock>
        <TextBlock ID="p1_b6" HPOS="89.9960" VPOS="137.038" HEIGHT="10.2080" WIDTH="419.244">
          <TextLine WIDTH="419.244" HEIGHT="10.2080" ID="p1_t6" HPOS="89.9960" VPOS="136.569">
            <String ID="p1_w10" CONTENT="Federico" HPOS="89.9960" VPOS="137.038" WIDTH="42.8010" HEIGHT="10.2080" STYLEREFS="font3"/>
            <SP WIDTH="3.0580" VPOS="137.038" HPOS="132.797"/>
            <String ID="p1_w11" CONTENT="Valdez" HPOS="135.855" VPOS="137.038" WIDTH="33.6270" HEIGHT="10.2080" STYLEREFS="font3"/>
            <String ID="p1_w12" CONTENT="a" HPOS="169.487" VPOS="136.569" WIDTH="3.8920" HEIGHT="6.4960" STYLEREFS="font4"/>
            <String ID="p1_w13" CONTENT="," HPOS="173.380" VPOS="137.038" WIDTH="3.0580" HEIGHT="10.2080" STYLEREFS="font3"/>
            <SP WIDTH="3.0580" VPOS="137.038" HPOS="176.438"/>
            <String ID="p1_w14" CONTENT="Julienne" HPOS="179.496" VPOS="137.038" WIDTH="40.9750" HEIGHT="10.2080" STYLEREFS="font3"/>
            <SP WIDTH="3.0580" VPOS="137.038" HPOS="220.471"/>
            <String ID="p1_w15" CONTENT="Salvador" HPOS="223.529" VPOS="137.038" WIDTH="43.4060" HEIGHT="10.2080" STYLEREFS="font3"/>
            <SP WIDTH="1.9500" VPOS="137.038" HPOS="266.935"/>

(see also https://github.com/kermitt2/grobid/issues/638#issuecomment-690310642)

de-code commented 4 years ago

BTW the way I found those is by looking at regressions of using my models with the new version of GROBID vs previous version. This affected the title extraction using my models. It could be the models are no longer tolerant to line numbers. I will re-generate the training data etc. and that problem might go away.

kermitt2 commented 3 years ago

Hi Daniel,

It's actually working in these examples relatively to these starting numbers. The remaining number 1 is a page number. As the block order follows the PDF stream order by default, the page number appears at the beginning of the page in the ALTO output, although visually it is located at the end of the page.

For the first pdf for instance, we have for the text content stream:

Combinatorial effect of promoter activity, mRNA degradation and site-specific

transcriptional pausing in modulating protein expression noise

Sangjin Kim 1,2,3 , Christine Jacobs-Wagner 1,2,3,4*


- second page: 

2

ABSTRACT

Genetically identical cells exhibit diverse phenotypes, even when experiencing the same

environment. This phenomenon, in part, originates from cell-to-cell variability (noise) in protein



and so on where this first token is the page number. 

This is the same case for the two other examples - page number at the beginning of the page token stream. 

BUT
in the second PDF however, there are a few problems with the second and fifth pages for instance with line numbers from `34` to `44` and `66` to `69` still appearing in the ALTO output. In these cases there's a slight change of width and alignment on both sides from `34` (not easy to see) and my clustering method absolutely wants an exact alignment on at least left or right... For covering that, I relaxed slightly the alignment within a 1.0 unit margin.

![Screenshot from 2021-04-05 18-08-59](https://user-images.githubusercontent.com/2340795/113598620-b2b04500-963d-11eb-9f1a-912d553ca1d2.png)
![Screenshot from 2021-04-05 18-24-56](https://user-images.githubusercontent.com/2340795/113598619-b17f1800-963d-11eb-91a1-9c9044ba0a57.png)

This case is working too now, following aaac4cd9379395f7e108b20a7a4c13e384545475.