ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.11k stars 1.02k forks source link

OCR PDF Attachments? #259

Open jmrichardson opened 6 years ago

jmrichardson commented 6 years ago

Does/will OCRmyPDF support embedded documents//attachments in a portfolio? Thanks

jbarlow83 commented 6 years ago

Not currently, and it's not planned any time soon, but I think you're second or third person to ask so there's some demand anyway. (See also #197)

I made some notes about how to go about doing this, whether it's useful to you for me as reference when I implement it:

Recently Ghostscript added PDF/A-3 so it's possible within Ghostscript. The current solution would be to modify the pdfmark file, named pdfa.ps, generated by ocrmypdf/pdfa.py, to include a step to embed the file insert according to the pdfmark specification: – see page 30, for the /EMBED command and this Ghostscript bug for a functioning example. Use absolute paths.

A better option would be to teach pikepdf how to embed files according to reference manual section 7.11.4, since this is would work without Ghostscript. OCRmyPDF will add pikepdf as dependency soon (I maintain both).

If you're able to do a PR for either I'd be happy to accept.