Data availabilty extraction failure use cases

lfoppiano commented 1 month ago

In this case the noisy footers are wrongly captured in the DAs.

<div type="availability">
                <div
                    xmlns="http://www.tei-c.org/ns/1.0">
                    <head>DATA AVAILABILITY</head>
                    <p>The genome assembly was uploaded to NCBI Genbank under the project number PRJNA1075679, with the genome accession number JBAGRT000000000. Research Article Microbiology Spectrum October 2024 Volume 12 Issue 10 10.1128/spectrum.00751-24 9 Downloaded from 
                        <ref type="url" target="https://journals.asm.org/journal/spectrum">https://journals.asm.org/journal/spectrum</ref> on 10 October 2024 by 2a01:e0a:d12:4600:6f65:dae8:67:55e1.
                    </p>
                </div>
            </div>

PDF (CC-BY): 3_10.1128_spectrum.00751-24.pdf

lfoppiano commented 4 weeks ago

Another example, the DAs is truncated by the page change

        <div type="availability">
                <div
                    xmlns="http://www.tei-c.org/ns/1.0">
                    <head>DATA AVAILABILITY</head>
                    <p>Data was provided and stored by MOH and CIHI as above. The dataset from this study is held securely in coded form at ICES. While legal data sharing agreements between ICES and data providers (e.g., healthcare organizations and government) prohibit ICES from making the dataset publicly available, access may be granted to those who meet</p>
                </div>
            </div>

PDF (CC-BY): 4_10.1128_spectrum.02630-23.pdf

lfoppiano commented 1 week ago

Here two more cases from nature:

s11084-024-09647-4.pdf: DAS is missed
s41588-024-01785-9.pdf: Code and data availabilty present, got only one, and there are a lot of header/footer urls in the middle
s41467-024-52091-1.pdf:DAS is truncated

kermitt2 / grobid

Data availabilty extraction failure use cases #1187