What steps will reproduce the problem?
1. Take a pdf file obtained by scanning a document. Many copy machines now have
the ability to produce pdf
2. Try to use tesseract on it.
What is the expected output? What do you see instead?
Tesseract fails because of unsupported input format. Would be nice to have
tesseract directly read pdf.
What version of the product are you using? On what operating system?
3.03 RC on linux
Please provide any additional information below.
Clearly, one can workaround the lack of support for pdf input by converting the
input to tiff before starting tesseract, e.g. with imagemagick. However, this
often results in image degradation because the actual resolution of the image
embedded in the pdf file is generally unknown. Would be better to have this
managed internally in tesseract. Furthermore, some copy machines produce pdfs
where each page contains multiple images since the copier does segmentation and
ends up storing different parts of the page at different resolutions. Would be
nice to have tesseract use up to the last bit of resolution in these cases.
Original issue reported on code.google.com by sergio.c...@gmail.com on 10 Jun 2015 at 9:46
Original issue reported on code.google.com by
sergio.c...@gmail.com
on 10 Jun 2015 at 9:46