petermr / norma

Convert XML/SVG/PDF into normalised, sectioned, scholarly HTML
Apache License 2.0
2 stars 4 forks source link

pdf2text transform quality of output is okay but not as good as pdftotext #15

Open rossmounce opened 9 years ago

rossmounce commented 9 years ago

Output from pdf2text transform (from norma) is reasonable but not brilliant. It's also quite slow tbh. Given pdftotext (Poppler) is already pre-installed in the workshop VM, perhaps better just to call it?

e.g. for this PeerJ article PDF: https://peerj.com/articles/900/

Compare output from pdftotext (Poppler):

ABSTRACT

Submitted 10 December 2014
Accepted 30 March 2015
Published 16 April 2015
Corresponding authors
Matthew B. Hufford,
mhufford@iastate.edu
Jeffrey Ross-Ibarra,
rossibarra@ucdavis.edu
Academic editor
Todd Vision
Additional Information and
Declarations can be found on
page 16

The teosinte branched1(tb1) gene is a major QTL controlling branching differences
between maize and its wild progenitor, teosinte. The insertion of a transposable
element (Hopscotch) upstream of tb1 is known to enhance the gene’s expression,
causing reduced tillering in maize. Observations of the maize tb1 allele in teosinte and
estimates of an insertion age of the Hopscotch that predates domestication led us to
investigate its prevalence and potential role in teosinte. We assessed the prevalence
of the Hopscotch element across an Americas-wide sample of 837 maize and teosinte
individuals using a co-dominant PCR assay. Additionally, we calculated population
genetic summaries using sequence data from a subset of individuals from four
teosinte populations and collected phenotypic data using seed from a single teosinte
population where Hopscotch was found segregating at high frequency. Genotyping
results indicate the Hopscotch element is found in a number of teosinte populations
and linkage disequilibrium near tb1 does not support recent introgression from
maize. Population genetic signatures are consistent with selection on the tb1 locus,
revealing a potential ecological role, but a greenhouse experiment does not detect
a strong association between the Hopscotch and tillering in teosinte. Our findings

to output from pdf2text (norma --transform), which unfortunately muddles the order of lines:

ABSTRACT
The teosinte branched1(tb1) gene is a major QTL controlling branching differences
between maize and its wild progenitor, teosinte. The insertion of a transposable
element (Hopscotch) upstream of tb1 is known to enhance the gene’s expression,
causing reduced tillering in maize. Observations of the maize tb1 allele in teosinte and
estimates of an insertion age of the Hopscotch that predates domestication led us to
investigate its prevalence and potential role in teosinte. We assessed the prevalence
of the Hopscotch element across an Americas-wide sample of 837 maize and teosinte
individuals using a co-dominant PCR assay. Additionally, we calculated population
genetic summaries using sequence data from a subset of individuals from four
teosinte populations and collected phenotypic data using seed from a single teosinte
population where Hopscotch was found segregating at high frequency. Genotyping
results indicate the Hopscotch element is found in a number of teosinte populations
and linkage disequilibrium near tb1 does not support recent introgression from
Submitted 10 December 2014 maize. Population genetic signatures are consistent with selection on the tb1 locus,
Accepted 30 March 2015 revealing a potential ecological role, but a greenhouse experiment does not detect
Published 16 April 2015
a strong association between the Hopscotch and tillering in teosinte. Our findings
Corresponding authors
Matthew B. Hu ord, suggest the role of Hopscotch differs between maize and teosinte. Future work shouldff
mhufford@iastate.edu assess tb1 expression levels in teosinte with and without the Hopscotch and more
Jeffrey Ross-Ibarra, comprehensively phenotype teosinte to assess the ecological significance of the
rossibarra@ucdavis.edu
Hopscotch insertion and, more broadly, the tb1 locus in teosinte.
Academic editor
Todd Vision
Subjects Agricultural Science, Ecology, Evolutionary Studies, Genetics
Additional Information and Keywords Transposable element, Domestication, Teosinte, Teosinte branched1, Maize
Declarations can be found on
page 16
DOI 10.7717/peerj.900 INTRODUCTION
Copyright Domesticated crops and their wild progenitors provide an excellent system in which
2015 Vann et al. to study adaptation and genomic changes associated with human-mediated selection
Distributed under (Ross-Ibarra, Morrell & Gaut, 2007). Plant domestication usually involves a suite of
Creative Commons CC-BY 4.0 phenotypic changes such as loss of seed shattering and increased fruit or grain size, which
OPEN ACCESS are commonly referred to as the ‘domestication syndrome’ (Olsen & Wendel, 2013), and
petermr commented 9 years ago

PDF2Text from PDFBox is pure Java so reliable to run. All *.exe's have the problem that they are forked processes and may give problems such as buffer overrun. It's generally more work to run these.

Also when we come to transport the software either the installer also has to install all these codes, or has to resort to JNI which has given us problems in the past.

PDF2text of any variety should only be used for words, not for sentences. (What happened to the OPEN ACCESS box under Poppler?) and how much slower is slower?

On Wed, Jul 8, 2015 at 4:54 PM, Ross Mounce notifications@github.com wrote:

Output from pdf2text transform (from norma) is reasonable but not brilliant. It's also quite slow tbh. Given pdftotext (Poppler) is already pre-installed in the workshop VM, perhaps better just to call it?

e.g. for this PeerJ article PDF: https://peerj.com/articles/900/

Compare output from pdftotext (Poppler):

ABSTRACT

Submitted 10 December 2014 Accepted 30 March 2015 Published 16 April 2015 Corresponding authors Matthew B. Hufford,mhufford@iastate.edu Jeffrey Ross-Ibarra,rossibarra@ucdavis.edu Academic editor Todd Vision Additional Information and Declarations can be found on page 16

The teosinte branched1(tb1) gene is a major QTL controlling branching differences between maize and its wild progenitor, teosinte. The insertion of a transposable element (Hopscotch) upstream of tb1 is known to enhance the gene’s expression, causing reduced tillering in maize. Observations of the maize tb1 allele in teosinte and estimates of an insertion age of the Hopscotch that predates domestication led us to investigate its prevalence and potential role in teosinte. We assessed the prevalence of the Hopscotch element across an Americas-wide sample of 837 maize and teosinte individuals using a co-dominant PCR assay. Additionally, we calculated population genetic summaries using sequence data from a subset of individuals from four teosinte populations and collected phenotypic data using seed from a single teosinte population where Hopscotch was found segregating at high frequency. Genotyping results indicate the Hopscotch element is found in a number of teosinte populations and linkage disequilibrium near tb1 does not support recent introgression from maize. Population genetic signatures are consistent with selection on the tb1 locus, revealing a potential ecological role, but a greenhouse experiment does not detect a strong association between the Hopscotch and tillering in teosinte. Our findings

to output from pdf2text (norma --transform), which unfortunately muddles the order of lines:

ABSTRACT The teosinte branched1(tb1) gene is a major QTL controlling branching differences between maize and its wild progenitor, teosinte. The insertion of a transposable element (Hopscotch) upstream of tb1 is known to enhance the gene’s expression, causing reduced tillering in maize. Observations of the maize tb1 allele in teosinte and estimates of an insertion age of the Hopscotch that predates domestication led us to investigate its prevalence and potential role in teosinte. We assessed the prevalence of the Hopscotch element across an Americas-wide sample of 837 maize and teosinte individuals using a co-dominant PCR assay. Additionally, we calculated population genetic summaries using sequence data from a subset of individuals from four teosinte populations and collected phenotypic data using seed from a single teosinte population where Hopscotch was found segregating at high frequency. Genotyping results indicate the Hopscotch element is found in a number of teosinte populations and linkage disequilibrium near tb1 does not support recent introgression from Submitted 10 December 2014 maize. Population genetic signatures are consistent with selection on the tb1 locus, Accepted 30 March 2015 revealing a potential ecological role, but a greenhouse experiment does not detect Published 16 April 2015 a strong association between the Hopscotch and tillering in teosinte. Our findings Corresponding authors Matthew B. Hu ord, suggest the role of Hopscotch differs between maize and teosinte. Future work shouldffmhufford@iastate.edu assess tb1 expression levels in teosinte with and without the Hopscotch and more Jeffrey Ross-Ibarra, comprehensively phenotype teosinte to assess the ecological significance of therossibarra@ucdavis.edu Hopscotch insertion and, more broadly, the tb1 locus in teosinte. Academic editor Todd Vision Subjects Agricultural Science, Ecology, Evolutionary Studies, Genetics Additional Information and Keywords Transposable element, Domestication, Teosinte, Teosinte branched1, Maize Declarations can be found on page 16 DOI 10.7717/peerj.900 INTRODUCTION Copyright Domesticated crops and their wild progenitors provide an excellent system in which 2015 Vann et al. to study adaptation and genomic changes associated with human-mediated selection Distributed under (Ross-Ibarra, Morrell & Gaut, 2007). Plant domestication usually involves a suite of Creative Commons CC-BY 4.0 phenotypic changes such as loss of seed shattering and increased fruit or grain size, which OPEN ACCESS are commonly referred to as the ‘domestication syndrome’ (Olsen & Wendel, 2013), and

— Reply to this email directly or view it on GitHub https://github.com/petermr/norma/issues/15.

Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069