kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.59k stars 459 forks source link

URLs where the regex capture less than the annotations are not consolidated with the clickable links from the PDF document #1191

Open lfoppiano opened 4 weeks ago

lfoppiano commented 4 weeks ago

In this document, the aboundance of spaces in the middle of the extracted URL makes sure that our regex falls short. However, the annotation are correct, but we somehow do not extend the matching beyond the initial regex extracted URL.

The result is quite messy:

            <div type="acknowledgement">
                <div>
                    <head>Acknowledgements</head>
                    [...] We thank 
                        <rs type="person">Mr. Tetsuo Kishi</rs> from the 
                        <rs type="affiliation">Department of Medicine, Kyushu University School of Medicine</rs> for the immunohistochemical analysis. We thank 
                        <rs type="person">J. Ludovic Croxford, PhD</rs>, from Edanz (
                        <ref type="url" target="https://jp">https:// jp</ref>. edanz. com/ ac) for editing a draft of this manuscript.
                    </p>
                </div>

PDF: 10_10.1038_s41598-021-96064-6.pdf

I've already drafted a fix with PR: #1190

lfoppiano commented 4 weeks ago

I have another case, that seems to happen only on Linux, but not on Mac. The master version on my mac works fine on this document, but not any deployed docker instance. 🤔

PDF (CC-BY): 2_10.1128_spectrum.00536-24.pdf

Mac (master):

<div
                xmlns="http://www.tei-c.org/ns/1.0">
                <head>Genome binning</head>
                <p>The sequencing depth of each contig was calculated using the functional script "jgi_summarize_bam_contig_depths", a tool of the MetaBAT2 (v.2.12.1) package 
                    <ref type="bibr" target="#b46">(46)</ref>, based on the sorted BAM files generated by using BWA-MEM (v.0.7.17; 
                    <ref type="url" target="http://biobwa.sourceforge.net/">http:// biobwa.sourceforge.net/</ref>) and SAMtools (v1.546; 
                    <ref type="url" target="http://www.htslib.org/">http://www.htslib.org/</ref>). MetaBAT2 was applied to bin the assemblies with contig depth results under the default parameters (minimum contig length ≥ 1500 bp). CheckM v.1.0.3 (
                    <ref type="url" target="https://ecogenomics.github.io/CheckM/">https://ecogenom ics.github.io/CheckM/</ref>) with the lineage_wf workflow was used to estimate the complete ness and contamination of MAGs 
                    <ref type="bibr" target="#b47">(47)</ref>. Dereplication based on the average nucleotide identity &gt;95% was performed using dRep (v2.3.2; parameter: -pa 0.95 -sa 0.99) 
                    <ref type="bibr" target="#b48">(48)</ref>. The "classify_wf" function of the GTDB Toolkit (GTDB-Tk, version r214; 
                    <ref type="url" target="https://gtdb.ecogenomic.org/">https://gtdb.eco genomic.org/</ref>) was introduced to obtain taxonomic information for each MAG. The amino acid sequences encoded by each MAG were also functionally annotated through comparison against the KEGG database.
                </p>
            </div>

Linux (0.8.1):

<div
                xmlns="http://www.tei-c.org/ns/1.0">
                <head>Genome binning</head>
                <p>The sequencing depth of each contig was calculated using the functional script "jgi_summarize_bam_contig_depths", a tool of the MetaBAT2 (v.2.12.1) package 
                    <ref type="bibr" target="#b46">(46)</ref>, based on the sorted BAM files generated by using BWA-MEM (v.0.7.17; 
                    <ref type="url" target="http://biobwa.sourceforge.net/">http:// biobwa.sourceforge.net/</ref>) and SAMtools (v1.546; 
                    <ref type="url" target="http://www.htslib.org/">http://www.htslib.org/</ref>). MetaBAT2 was applied to bin the assemblies with contig depth results under the default parameters (minimum contig length ≥ 1500 bp). CheckM v.1.0.3 (
                    <ref type="url" target="https://ecogenom">https://ecogenom</ref> ics.github.io/CheckM/) with the lineage_wf workflow was used to estimate the complete ness and contamination of MAGs 
                    <ref type="bibr" target="#b47">(47)</ref>. Dereplication based on the average nucleotide identity &gt;95% was performed using dRep (v2.3.2; parameter: -pa 0.95 -sa 0.99) 
                    <ref type="bibr" target="#b48">(48)</ref>. The "classify_wf" function of the GTDB Toolkit (GTDB-Tk, version r214; 
                    <ref type="url" target="https://gtdb.eco">https://gtdb.eco</ref> genomic.org/) was introduced to obtain taxonomic information for each MAG. The amino acid sequences encoded by each MAG were also functionally annotated through comparison against the KEGG database.
                </p>
            </div>

Linux (branch: )

<div
                xmlns="http://www.tei-c.org/ns/1.0">
                <head>Genome binning</head>
                <p>The sequencing depth of each contig was calculated using the functional script "jgi_summarize_bam_contig_depths", a tool of the MetaBAT2 (v.2.12.1) package 
                    <ref type="bibr" target="#b46">(46)</ref>, based on the sorted BAM files generated by using BWA-MEM (v.0.7.17; 
                    <ref type="url" target="http://biobwa.sourceforge.net/">http:// biobwa.sourceforge.net/</ref>) and SAMtools (v1.546; 
                    <ref type="url" target="http://www.htslib.org/">http://www.htslib.org/</ref>). MetaBAT2 was applied to bin the assemblies with contig depth results under the default parameters (minimum contig length ≥ 1500 bp). CheckM v.1.0.3 (
                    <ref type="url" target="https://ecogenom">https://ecogenom</ref> ics.github.io/CheckM/) with the lineage_wf workflow was used to estimate the complete ness and contamination of MAGs 
                    <ref type="bibr" target="#b47">(47)</ref>. Dereplication based on the average nucleotide identity &gt;95% was performed using dRep (v2.3.2; parameter: -pa 0.95 -sa 0.99) 
                    <ref type="bibr" target="#b48">(48)</ref>. The "classify_wf" function of the GTDB Toolkit (GTDB-Tk, version r214; 
                    <ref type="url" target="https://gtdb.eco">https://gtdb.eco</ref> genomic.org/) was introduced to obtain taxonomic information for each MAG. The amino acid sequences encoded by each MAG were also functionally annotated through comparison against the KEGG database.
                </p>
            </div>
lfoppiano commented 3 weeks ago

I've discovered that there might be more than one PDF annotation that overlap over a certain token, so I had to tighten the leash when looking for the PDF annotation in case they are > 1 I need to add a match on the token which is increased recursively by adding the n-1, n-2 tokens if more than one annotation is matching.

PDF documents are a mess :-)