Open lfoppiano opened 4 weeks ago
I have another case, that seems to happen only on Linux, but not on Mac. The master version on my mac works fine on this document, but not any deployed docker instance. 🤔
PDF (CC-BY): 2_10.1128_spectrum.00536-24.pdf
Mac (master):
<div
xmlns="http://www.tei-c.org/ns/1.0">
<head>Genome binning</head>
<p>The sequencing depth of each contig was calculated using the functional script "jgi_summarize_bam_contig_depths", a tool of the MetaBAT2 (v.2.12.1) package
<ref type="bibr" target="#b46">(46)</ref>, based on the sorted BAM files generated by using BWA-MEM (v.0.7.17;
<ref type="url" target="http://biobwa.sourceforge.net/">http:// biobwa.sourceforge.net/</ref>) and SAMtools (v1.546;
<ref type="url" target="http://www.htslib.org/">http://www.htslib.org/</ref>). MetaBAT2 was applied to bin the assemblies with contig depth results under the default parameters (minimum contig length ≥ 1500 bp). CheckM v.1.0.3 (
<ref type="url" target="https://ecogenomics.github.io/CheckM/">https://ecogenom ics.github.io/CheckM/</ref>) with the lineage_wf workflow was used to estimate the complete ness and contamination of MAGs
<ref type="bibr" target="#b47">(47)</ref>. Dereplication based on the average nucleotide identity >95% was performed using dRep (v2.3.2; parameter: -pa 0.95 -sa 0.99)
<ref type="bibr" target="#b48">(48)</ref>. The "classify_wf" function of the GTDB Toolkit (GTDB-Tk, version r214;
<ref type="url" target="https://gtdb.ecogenomic.org/">https://gtdb.eco genomic.org/</ref>) was introduced to obtain taxonomic information for each MAG. The amino acid sequences encoded by each MAG were also functionally annotated through comparison against the KEGG database.
</p>
</div>
Linux (0.8.1):
<div
xmlns="http://www.tei-c.org/ns/1.0">
<head>Genome binning</head>
<p>The sequencing depth of each contig was calculated using the functional script "jgi_summarize_bam_contig_depths", a tool of the MetaBAT2 (v.2.12.1) package
<ref type="bibr" target="#b46">(46)</ref>, based on the sorted BAM files generated by using BWA-MEM (v.0.7.17;
<ref type="url" target="http://biobwa.sourceforge.net/">http:// biobwa.sourceforge.net/</ref>) and SAMtools (v1.546;
<ref type="url" target="http://www.htslib.org/">http://www.htslib.org/</ref>). MetaBAT2 was applied to bin the assemblies with contig depth results under the default parameters (minimum contig length ≥ 1500 bp). CheckM v.1.0.3 (
<ref type="url" target="https://ecogenom">https://ecogenom</ref> ics.github.io/CheckM/) with the lineage_wf workflow was used to estimate the complete ness and contamination of MAGs
<ref type="bibr" target="#b47">(47)</ref>. Dereplication based on the average nucleotide identity >95% was performed using dRep (v2.3.2; parameter: -pa 0.95 -sa 0.99)
<ref type="bibr" target="#b48">(48)</ref>. The "classify_wf" function of the GTDB Toolkit (GTDB-Tk, version r214;
<ref type="url" target="https://gtdb.eco">https://gtdb.eco</ref> genomic.org/) was introduced to obtain taxonomic information for each MAG. The amino acid sequences encoded by each MAG were also functionally annotated through comparison against the KEGG database.
</p>
</div>
Linux (branch: )
<div
xmlns="http://www.tei-c.org/ns/1.0">
<head>Genome binning</head>
<p>The sequencing depth of each contig was calculated using the functional script "jgi_summarize_bam_contig_depths", a tool of the MetaBAT2 (v.2.12.1) package
<ref type="bibr" target="#b46">(46)</ref>, based on the sorted BAM files generated by using BWA-MEM (v.0.7.17;
<ref type="url" target="http://biobwa.sourceforge.net/">http:// biobwa.sourceforge.net/</ref>) and SAMtools (v1.546;
<ref type="url" target="http://www.htslib.org/">http://www.htslib.org/</ref>). MetaBAT2 was applied to bin the assemblies with contig depth results under the default parameters (minimum contig length ≥ 1500 bp). CheckM v.1.0.3 (
<ref type="url" target="https://ecogenom">https://ecogenom</ref> ics.github.io/CheckM/) with the lineage_wf workflow was used to estimate the complete ness and contamination of MAGs
<ref type="bibr" target="#b47">(47)</ref>. Dereplication based on the average nucleotide identity >95% was performed using dRep (v2.3.2; parameter: -pa 0.95 -sa 0.99)
<ref type="bibr" target="#b48">(48)</ref>. The "classify_wf" function of the GTDB Toolkit (GTDB-Tk, version r214;
<ref type="url" target="https://gtdb.eco">https://gtdb.eco</ref> genomic.org/) was introduced to obtain taxonomic information for each MAG. The amino acid sequences encoded by each MAG were also functionally annotated through comparison against the KEGG database.
</p>
</div>
I've discovered that there might be more than one PDF annotation that overlap over a certain token, so I had to tighten the leash when looking for the PDF annotation in case they are > 1 I need to add a match on the token which is increased recursively by adding the n-1, n-2 tokens if more than one annotation is matching.
PDF documents are a mess :-)
In this document, the aboundance of spaces in the middle of the extracted URL makes sure that our regex falls short. However, the annotation are correct, but we somehow do not extend the matching beyond the initial regex extracted URL.
The result is quite messy:
PDF: 10_10.1038_s41598-021-96064-6.pdf
I've already drafted a fix with PR: #1190