ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
13.62k stars 996 forks source link

[Feature]: Add a flag to enable ocrmypdf to write "last-modified attribute" to the OCR'ed file. #1389

Closed ashrockd closed 6 days ago

ashrockd commented 2 weeks ago

Describe the proposed feature

I am indexing the scanned files. I indexed them once before OCR. Now after OCR i want to have their content indexed, however the utility (docfetcher) that I am using simply ignore the already indexed file as it doesn't know that they have been modified with an added text layer.

Is it possible for ocrmypdf to have a flag to toggle the last-modified/written attribute.

here is the discussion on docfetcher forum

ashrockd commented 2 weeks ago

i'm looking for a flag/option that goes something like ocrmypdf --write-attributes input.pdf output.pdf

jbarlow83 commented 2 weeks ago

The default behavior is that output.pdf will have a last modified date set to the time ocrmypdf finishes. The file system, the DocumentInfo dictionary, and the XMP metadata will have the same date set.

If you have pikepdf older than 9.0.0 installed, you may see different behavior (see https://github.com/pikepdf/pikepdf/issues/595), or if for some reason the operating system does not permit you to change timestamps.