Ehnancement: please provide pdf to pdf ocr

What steps will reproduce the problem?

1. Take a pdf file obtained by scanning a document. Many copy machines now have 
the ability to produce pdf
2. Try to use tesseract on it. 

What is the expected output? What do you see instead?

Tesseract fails because of unsupported input format. Would be nice to have 
tesseract directly read pdf.

What version of the product are you using? On what operating system?

3.03 RC on linux

Please provide any additional information below.

Clearly, one can workaround the lack of support for pdf input by converting the 
input to tiff before starting tesseract, e.g. with imagemagick. However, this 
often results in image degradation because the actual resolution of the image 
embedded in the pdf file is generally unknown. Would be better to have this 
managed internally in tesseract. Furthermore, some copy machines produce pdfs 
where each page contains multiple images since the copier does segmentation and 
ends up storing different parts of the page at different resolutions. Would be 
nice to have tesseract use up to the last bit of resolution in these cases.

Original issue reported on code.google.com by sergio.c...@gmail.com on 10 Jun 2015 at 9:46

gxrxrdx / tesseract-ocr

Ehnancement: please provide pdf to pdf ocr #1486