internetarchive / archive-pdf-tools

Fast PDF generation and compression. Deals with millions of pages daily.
https://archive-pdf-tools.readthedocs.io/en/latest/
GNU Affero General Public License v3.0
86 stars 13 forks source link

Some scans become inverted #45

Closed Redsandro closed 2 years ago

Redsandro commented 2 years ago

I've noticed it two times before, and I thought it was a computer issue because I scanned too large at 600 dpi. But now I encounter this for a third time, this time while scanning a small card at 300 dpi. I'm beginning to think this might be a bug.

Original: Left. recode_pdf: Right. image

My normal workflow:

ls -1 *.png > in.txt
tesseract -l nld+eng --dpi 300 in.txt out hocr
recode_pdf -v -m 2 --dpi 300 --from-imagestack "./*.png" --hocr-file out.hocr -o "out-recode.pdf"

Is this a known issue? Is there a known workaround? I did a quick search, didn't turn up anything. I'm not sure I can share the full resolution card openly because it is copyrighted, but if this issue is never seen before I am willing to email full resolution file for testing purposes.

$ recode_pdf --version
internetarchivepdf 1.4.14
MerlijnWajer commented 2 years ago

Please share the images by email - I have not seen this before. You can reach me on my first name (merlijn) on the internet archive website (archive.org)

MerlijnWajer commented 2 years ago

If you could, please also share the output PDF that you get, for good measure.

Redsandro commented 2 years ago

Please share the images by email - I have not seen this before. You can reach me on (...)

I have sent you an email. (Please download the attachment before the link expires, even if you don't have time to look at it yet.) I removed the irrelevant pages with text because it happens with this page.

MerlijnWajer commented 2 years ago

Got it, thanks. It looks like the jpx files in the PDF are CMYK somehow, that's probably related, will let you know.

MerlijnWajer commented 2 years ago

Removing the transparency layer from the file makes it work, let me see where it goes wrong in archive-pdf-tools then.

MerlijnWajer commented 2 years ago

I think you've actually hit a pretty significant issue, which isn't been hit in the archive.org path basically ever due to the materials that we deal with, but this explains why I had some trouble recoding some existing digital PDFs when I was toying with a tool for OCRmyPDF compression.

In any case, the commit above should fix it for your case, while I try to think of a way to maybe support transparency in MRC? I don't think there is a way.

Redsandro commented 2 years ago

Thank you! This works. :+1:

I had no idea that alpha was introduced somewhere in my pipeline, or that it would cause a problem.

I think some editors, some scanners, some pipes, either they intermittently add alpha, or the alpha is only intermittently a problem because I do multiple things in the same worlflow and it isn't always a problem.

I try to think of a way to maybe support transparency in MRC? I don't think there is a way.

I don't think you should waste your time on that. I see no reason why alpha would need to be supported in a document archival tool. Alpha channels in scans, if any, will be 100% opaque near 100% of the time and can safely be discarded.