Open Jmuccigr opened 1 year ago
Check that you have the latest pikepdf. 5.6.1 introduced a possible fix to some black/white inversion issues.
I've got 6.0.2.
Any thoughts?
Thoughts
--json
features as a way of showing me the structure of the PDF without the contentAny updates on this issue? I have similar problems and the version of pikepdf is 6.2.1
@alirf81 If you'd like to move things along faster please submit a reproducible example PDF and conmand line.
Hi there. Thank you so much for working on and maintaining this project.
I have been experiencing a similar issue: When I try to optimize a particular pdf (without performing OCR) and to have it be converted into a regular pdf (rather than pdf/a), the resulting pdf also inverts black and white. I have tried it on two pdfs (of scanned books) so so far, and it keeps happening to one of them, which has a little bit of a black margin on every other page (don't know if that's relevant). I use the following input:
ocrmypdf --output-type pdf --tesseract-timeout=0 --optimize 3 --skip-text input.pdf output.pdf
If I do it without --output-type pdf
everything seems fine.
I am running macOS 12.6.1, and OCRmyPDF 14.0.1; and just homebrew updated/upgraded everything. As you probably can tell I'm not a superuser, so I don't know how to get the structure of pdfs, etc.
If I'm not using the best command to optimize an already ocred pdf and have it saved as a regular pdf, I'd appreciate your help on that as well.
Is there a way to quickly verify whether a pdf is regular or pdf/a on macos, without using, say, Adobe Acrobat?
Many thanks!
Hmm, if I use pdfimages to extract the image from my PDF, it produces a ccitt/params pair which, when I use fax2tiff on, produces the same kind of inverted image. If I tell pdfimages to output a png, the image has the expected colors.
[I had to delete and repost this comment because I made a mistake and uploaded the wrong files. Sorry…]
Here is an example, with everything that lead to its creation. It’s a blank page, but all the pages with text from the same original file created using the same process got inverted in the end.
A.pdf
, but when OCRing it (i.e. the other pages with text in them), the result had spaces between almost each letters, so I decided to extract the images and rebuild a PDF file and reOCR the result.pdfimages -tiff A.pdf B
img2pdf --output C.pdf B-000.tif
ocrmypdf --language eng --output-type pdf C.pdf D.pdf
— The resulting file D.pdf
is now correctly OCRed, without spaces between the letters, but white-on-black rather than black-on-white.Here are all the files, except B-000.tif
since GitHub doesn’t allow me to upload it.
A.pdf
C.pdf
D.pdf
Versions:
pdfimages
)
Working with a PDF that has only tiff images in it, created with ImageMagick and then assembled into a PDF with img2pdf. Forcing no optimization leaves the images ok. Seems like same result as #419.