ropensci / qpdf

Split, Combine and Compress PDF files
https://docs.ropensci.org/qpdf
Other
57 stars 10 forks source link

Question: overlaying OCR'd text in package scope? #2

Closed treysp closed 3 years ago

treysp commented 5 years ago

Hello,

Thanks so much for pdftools, qpdf, and all the other ropensci packages!

I recently received scanned pdfs and needed to make them searchable. The OCRmyPDF library accomplishes that by running OCR with Tesseract then adding an invisible text layer over the base raster layer.

It appears that OCRmyPDF uses pikepdf as its primary PDF manipulation tool, and pikepdf is built on QPDF.

I'm not sure if making PDFs searchable is common enough to warrant building, but if it were is that in scope for this package or would it belong somewhere else?

Best, Trey

EDIT: Tesseract has a text-only PDF output option that may allow using qpdf's overlay function to create the searchable text layer. Discussion at Tesseract issue 660.

Apparently OCRmyPDF uses that Tesseract output to create the overlaid PDF page. I can't quite figure out if the "sandwich renderer" is a name they came up with or an actual external tool.