kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.59k stars 459 forks source link

Data availabilty extraction failure use cases #1187

Open lfoppiano opened 1 month ago

lfoppiano commented 1 month ago

In this case the noisy footers are wrongly captured in the DAs.

<div type="availability">
                <div
                    xmlns="http://www.tei-c.org/ns/1.0">
                    <head>DATA AVAILABILITY</head>
                    <p>The genome assembly was uploaded to NCBI Genbank under the project number PRJNA1075679, with the genome accession number JBAGRT000000000. Research Article Microbiology Spectrum October 2024 Volume 12 Issue 10 10.1128/spectrum.00751-24 9 Downloaded from 
                        <ref type="url" target="https://journals.asm.org/journal/spectrum">https://journals.asm.org/journal/spectrum</ref> on 10 October 2024 by 2a01:e0a:d12:4600:6f65:dae8:67:55e1.
                    </p>
                </div>
            </div>

PDF (CC-BY): 3_10.1128_spectrum.00751-24.pdf

lfoppiano commented 4 weeks ago

Another example, the DAs is truncated by the page change

image
        <div type="availability">
                <div
                    xmlns="http://www.tei-c.org/ns/1.0">
                    <head>DATA AVAILABILITY</head>
                    <p>Data was provided and stored by MOH and CIHI as above. The dataset from this study is held securely in coded form at ICES. While legal data sharing agreements between ICES and data providers (e.g., healthcare organizations and government) prohibit ICES from making the dataset publicly available, access may be granted to those who meet</p>
                </div>
            </div>

PDF (CC-BY): 4_10.1128_spectrum.02630-23.pdf

lfoppiano commented 1 week ago

Here two more cases from nature: