pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.53k stars 519 forks source link

Image in DeviceGray has reversed black and white #98

Closed bimusiek closed 7 years ago

bimusiek commented 7 years ago

Hey, I have a pdf that contains qr code. However, when reading it the black and white is reversed (thus I cannot read qr).

Any idea how can I know to invert colors?

DEBUG 2017-09-25 17:25:49,167 image_extractor 12122 139906630657792 Pix: fitz.Pixmap(DeviceGray, fitz.IRect(0, 0, 198, 68), 0)
DEBUG 2017-09-25 17:25:49,167 image_extractor 12122 139906630657792 Pix: fitz.Colorspace(fitz.CS_GRAY) - DeviceGray
JorjMcKie commented 7 years ago

Hi, pls help me understand: what is qr?

You can always invert colors with pixmap method invertIRect(irect).

Other than that, users had similar issues when their MuPDF was not generated exclusively and completely based on thirdparty software included in the MuPDF package, i.e. mixtures of libraries on their system and MuPDF.

If you want, you can send me your PDF so I can help investigate.

Please also be explicit with your OS / Python / PyMuPDF version.

JorjMcKie commented 7 years ago

Any news on this?

bimusiek commented 7 years ago

Hey @JorjMcKie , can you share your email? PDF is some old passbook but with our client details so I cannot share it publicly.

JorjMcKie commented 7 years ago

I have taken a look at the PDF in the meantime: MuPDF (mutool extract) and hence PyMuPDF do indeed not correctly reproduce this image (again a barcode). Outcome is instead a pure black image without applying the mask which is also stored in the PDF. In contrast, Nitro PDF for example does this correctly.

I will submit an issue to MuPDF and continue looking in the library for ways out out.

bimusiek commented 7 years ago

Thanks a lot for investigating 👍

JorjMcKie commented 7 years ago

Hi again, I found out, that mutool extract indeed does find the barcode image and its masking image, which it treats as an independent pixmap. I have experimented a little and found the following skript working to re-create the original barcode image with correct coloring. In what follows, test.pdf is your anonymized PDF, containing the barcode image as PDF object number 6. Object number 7 is the corresponding SMask image. What this script does is creating a RGBA samples area, where the RGB values are taken from pixmap of image 6 and the alpha values taken from pixmap object 7 samples. When saving the pixmap created from this new samples (bytearray called ba) as test.png, it shows the original.

gluepix.txt

Thanks to your inquiry, I now have a few things to extend PyMuPDF with ... :-)

JorjMcKie commented 7 years ago

@bimusiek I have added functionality to solve your problem - hopefully in a fairly elegant way:

pix1 = fitz.Pixmap(doc, xref)     # pixmap without alpha channel
pix2 = fitz.Pixmap(doc, smask)    # this contains the alpha values in its samples
pix3 = fitz.Pixmap(pix1)     # copy of pix1 with alpha channel added (new constructor)
pix3.setAlpha(pix2.samples)     # fill its alpha values with pix2 samples (new method)
pix3.writePNG("Im2.png")     # this should look right now!

Pixmap pix3 should now reflect the original image - and be what you were missing ... let me know your experiences.