Open jbothma opened 8 years ago
Definitely. I think it would relatively straightforward to integrate. Would suggest building the text insertion into the Page
class and then put a export_to_pdf()
method on the Document
class.
Would you be interested in contributing @jbothma ?
Yup - would love to. Won't get to it before next week but will start a PR when I can :)
It's part of the ocr command as an optional output format so not sure what the right place would be to integrate it with doc2text.
Awesome, thank you!
The method's location in the code would be conditional on the way tesseract embeds that data. Does tesseract insert the data into a PDF, or it in a separate state that contains the text and placement information?
In the first case, we would need the method you mentioned that produces a nicely optimized pdf from the images first, then the embedding second. We need this method regardless, I think. In the second case, we could run the tesseract embed method at any time after we produce the fixed image crop.
Thoughts?
So this is basically what I was talking about.
wget http://mfma.treasury.gov.za/MFMA/Urban%20Development%20Zones/Gazette%20No.%2026866.pdf
gs -dNOPAUSE -q -r500 -sDEVICE=tiffg4 -dBATCH -sOutputFile=test.tif Gazette\ No.\ 26866.pdf
tesseract test.tif outbase pdf
Tesseract produces the PDF already, so you'd select that as the output format of the OCR step. There's no intermediate hOCR or anything.
tesseract seems to be able to produce PDFs these days with text overlaid on the image. This is useful for searching int he PDF when viewing that way.
It'd be nice if this could produce nice de-skewed PDFs