ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
13.56k stars 992 forks source link

Inverted black and white from optimization #1015

Open Jmuccigr opened 1 year ago

Jmuccigr commented 1 year ago

Working with a PDF that has only tiff images in it, created with ImageMagick and then assembled into a PDF with img2pdf. Forcing no optimization leaves the images ok. Seems like same result as #419.

jbarlow83 commented 1 year ago

Check that you have the latest pikepdf. 5.6.1 introduced a possible fix to some black/white inversion issues.

Jmuccigr commented 1 year ago

I've got 6.0.2.

Jmuccigr commented 1 year ago

Any thoughts?

jbarlow83 commented 1 year ago

Thoughts

alirf81 commented 1 year ago

Any updates on this issue? I have similar problems and the version of pikepdf is 6.2.1

jbarlow83 commented 1 year ago

@alirf81 If you'd like to move things along faster please submit a reproducible example PDF and conmand line.

poldy8 commented 1 year ago

Hi there. Thank you so much for working on and maintaining this project.

I have been experiencing a similar issue: When I try to optimize a particular pdf (without performing OCR) and to have it be converted into a regular pdf (rather than pdf/a), the resulting pdf also inverts black and white. I have tried it on two pdfs (of scanned books) so so far, and it keeps happening to one of them, which has a little bit of a black margin on every other page (don't know if that's relevant). I use the following input:

ocrmypdf --output-type pdf --tesseract-timeout=0 --optimize 3 --skip-text input.pdf output.pdf

If I do it without --output-type pdf everything seems fine.

I am running macOS 12.6.1, and OCRmyPDF 14.0.1; and just homebrew updated/upgraded everything. As you probably can tell I'm not a superuser, so I don't know how to get the structure of pdfs, etc.

If I'm not using the best command to optimize an already ocred pdf and have it saved as a regular pdf, I'd appreciate your help on that as well.

Is there a way to quickly verify whether a pdf is regular or pdf/a on macos, without using, say, Adobe Acrobat?

Many thanks!

Jmuccigr commented 1 year ago

Hmm, if I use pdfimages to extract the image from my PDF, it produces a ccitt/params pair which, when I use fax2tiff on, produces the same kind of inverted image. If I tell pdfimages to output a png, the image has the expected colors.

vejkse commented 1 year ago

[I had to delete and repost this comment because I made a mistake and uploaded the wrong files. Sorry…]

Here is an example, with everything that lead to its creation. It’s a blank page, but all the pages with text from the same original file created using the same process got inverted in the end.

  1. The original PDF file was A.pdf, but when OCRing it (i.e. the other pages with text in them), the result had spaces between almost each letters, so I decided to extract the images and rebuild a PDF file and reOCR the result.
  2. pdfimages -tiff A.pdf B
  3. img2pdf --output C.pdf B-000.tif
  4. ocrmypdf --language eng --output-type pdf C.pdf D.pdf — The resulting file D.pdf is now correctly OCRed, without spaces between the letters, but white-on-black rather than black-on-white.

Here are all the files, except B-000.tif since GitHub doesn’t allow me to upload it. A.pdf C.pdf D.pdf

Versions: