kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.5k stars 449 forks source link

Some incomplete coordinates for sentence elements #811

Closed kermitt2 closed 3 years ago

kermitt2 commented 3 years ago

For this example (preprint):

[Uploading document_sentence_segmentation_issues.pdf…]()

we have some incomplete bounding boxes for coordinates at sentence-level, see the 5 last sentences of this paragraph:

          <div
                xmlns="http://www.tei-c.org/ns/1.0">
                <head coords="23,54.00,212.69,163.38,11.14">ODD-luciferase activity assaay</head>
                <p>
                    <s coords="23,54.00,240.29,504.00,11.14;23,54.00,267.89,158.47,11.14">The ODD-luciferase construct with pcDNA3.1 plasmid vector was constructed as previously described 
                        <ref type="bibr" coords="23,108.63,267.89,92.50,11.14" target="#b42">(Safran et al. 2006</ref>).
                    </s>
                    <s coords="23,215.06,267.89,342.94,11.14;23,54.00,295.49,504.01,11.14;23,54.00,323.09,222.14,11.14">The proline p402 and p564 present within the oxygen degradation domain (ODD) of HIF1α, when hydroxylated by HIF-PHDs, allow its binding to the VHL protein that target it for proteasomal degradation.</s>
                    <s coords="23,279.87,323.09,278.13,11.14;23,54.00,350.69,354.71,11.14">In this way, the stabilization of ODD can be used as a marker of HIF1α stability 
                        <ref type="bibr" coords="23,197.48,350.69,95.96,11.14" target="#b42">(Safran et al. 2006</ref>
                        <ref type="bibr" coords="23,293.44,350.69,115.26,11.14" target="#b48">, Smirnova et al. 2010</ref>.
                    </s>
                    <s coords="23,408.71,350.69,11.34,11.14">Because of the luciferase tagged with ODD, the increase in ODD stability leads to a proportional increase in the luciferase activity and this provides a very good way of measuring the HIF1α stability in a quantitative manner with a wide dynamic range.</s>
                    <s coords="23,423.35,350.69,46.70,11.14">To this end, we used SH-SY5Y cells stably expressing ODDluciferase.</s>
                    <s coords="23,473.35,350.69,10.01,11.14">These cells were made by co-transfecting ODD-luciferase plasmid along with a puromycin resistance plasmid in SH-SY5Y cells and stably transfected cells were positively selected in presence of 4μg/ml of puromycin.</s>
                    <s coords="23,486.66,350.69,71.34,11.14">Luciferase activity was measured by luciferase assay kit (promega) using an LMaxII TM microplate luminometer (molecular Devices).</s>
                    <s coords="23,54.00,378.29,36.70,11.14">ODDluciferase activity was normalized to the protein content of each well measured by Bio-Rad DC TM protein assay kit.</s>
                </p>
            </div>
kermitt2 commented 3 years ago

error case via https://github.com/DataSeer/dataseer-web/issues/441

lfoppiano commented 3 years ago

It seems that the PDF is not reachable 😺

kermitt2 commented 3 years ago

Sorry poor internet connection :(

document_sentence_segmentation_issues.pdf

kermitt2 commented 3 years ago

Normally the text to be segmented includes the references (all text including descendant elements):

https://github.com/kermitt2/grobid/blob/cdb52adaebddd6215f87a7d144d7f7ff118233bb/grobid-core/src/main/java/org/grobid/core/document/TEIFormatter.java#L1402

and we only keep track of the positions of the references to pass the "forbidden positions" to the segmenter:

https://github.com/kermitt2/grobid/blob/cdb52adaebddd6215f87a7d144d7f7ff118233bb/grobid-core/src/main/java/org/grobid/core/document/TEIFormatter.java#L1430

It seems that until that step, it works fine, the texts of the sentences look good.

The problem is probably then we try to group the LayoutToken corresponding of each sentence in segmentedParagraphTokens. The text segmented is coming from XML is a bit different than the text from LayoutToken (de-hyphenization, some spaces removed), and the alignment can be challenging.

kermitt2 commented 3 years ago

PR #821 fixes the problem, which was due to a leftover in the reference pattern (year pattern) missing in the XML.

All the coordinates for sentence elements now look good:

           <div
                xmlns="http://www.tei-c.org/ns/1.0">
                <head coords="23,54.00,212.69,163.38,11.14">ODD-luciferase activity assaay</head>
                <p>
                    <s coords="23,54.00,240.29,504.00,11.14;23,54.00,267.89,158.47,11.14">The ODD-luciferase construct with pcDNA3.1 plasmid vector was constructed as previously described 
                        <ref type="bibr" coords="23,108.63,267.89,92.50,11.14" target="#b42">(Safran et al. 2006</ref>).
                    </s>
                    <s coords="23,215.06,267.89,342.94,11.14;23,54.00,295.49,504.01,11.14;23,54.00,323.09,222.14,11.14">The proline p402 and p564 present within the oxygen degradation domain (ODD) of HIF1α, when hydroxylated by HIF-PHDs, allow its binding to the VHL protein that target it for proteasomal degradation.</s>
                    <s coords="23,279.87,323.09,278.13,11.14;23,54.00,350.69,366.05,11.14">In this way, the stabilization of ODD can be used as a marker of HIF1α stability 
                        <ref type="bibr" coords="23,197.48,350.69,95.96,11.14" target="#b42">(Safran et al. 2006</ref>
                        <ref type="bibr" coords="23,293.44,350.69,120.93,11.14" target="#b48">, Smirnova et al. 2010)</ref>.
                    </s>
                    <s coords="23,423.35,350.69,134.65,11.14;23,54.00,378.29,504.01,11.14;23,54.00,405.89,504.01,11.14;23,54.00,433.49,185.14,11.14">Because of the luciferase tagged with ODD, the increase in ODD stability leads to a proportional increase in the luciferase activity and this provides a very good way of measuring the HIF1α stability in a quantitative manner with a wide dynamic range.</s>
                    <s coords="23,241.74,433.49,316.26,11.14;23,54.00,461.09,54.70,11.14">To this end, we used SH-SY5Y cells stably expressing ODDluciferase.</s>
                    <s coords="23,114.49,461.09,443.52,11.14;23,54.00,488.69,504.01,11.14;23,54.00,516.29,245.80,11.14">These cells were made by co-transfecting ODD-luciferase plasmid along with a puromycin resistance plasmid in SH-SY5Y cells and stably transfected cells were positively selected in presence of 4μg/ml of puromycin.</s>
                    <s coords="23,304.37,516.29,253.64,11.14;23,54.00,543.89,205.57,11.14;23,259.57,542.76,11.58,7.46;23,276.88,543.89,244.71,11.14">Luciferase activity was measured by luciferase assay kit (promega) using an LMaxII TM microplate luminometer (molecular Devices).</s>
                    <s coords="23,527.35,543.89,30.65,11.14;23,54.00,571.49,492.45,11.14;23,546.45,570.36,11.58,7.46;23,54.00,599.09,90.05,11.14">ODDluciferase activity was normalized to the protein content of each well measured by Bio-Rad DC TM protein assay kit.</s>
                </p>
            </div>
kermitt2 commented 3 years ago

Also good for https://github.com/DataSeer/dataseer-web/issues/461