pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.56k stars 521 forks source link

Question / Comment: Setting xres/yres for new image pages #479

Closed wohali closed 4 years ago

wohali commented 4 years ago

Hi there,

PyMuPDF is great! Thanks for the best PDF support in Python on the planet.

I'm currently working on some code to straighten scanned images in PDFs. I use PyMuPDF, imageio and numpy/scikit to do most of the heavy lifting.

Following the guide, I can't figure out how restore the original xres/yres of the extracted image.

Code excerpt, very similar to your example:

doc = fitz.open(filename)
imgdict = doc.extractImage(xref)
img = imread(imgdict["image"])
# image is manipulated, then:
imgbytes = imwrite("<bytes>", img, format=imgext)
imgdoc = fitz.open(stream=imgbytes, filetype=imgext)
rect = imgdoc[0].rect
pdfbytes = imgdoc.convertToPDF()
imgdoc.close()
imgPDF = fitz.open("pdf", pdfbytes)
page = newdoc.newPage(width=rect.width, height=rect.height)
page.showPDFpage(rect, imgPDF, 0)

I think the issue is in the fitz.open() call. I don't see any way to pass in xres/yres (ppi) options to the constructor. No matter what I do, I end up with 96ppi. (Certain "fruit-branded" PDF readers care about this value a lot.)

How do I retain the original xres/yres values when creating a new document from an image byte stream?

wohali commented 4 years ago

Actually, I'm seeing a bug here as well.

Reference this PDF: https://github.com/wohali/hough/blob/master/samples/Newman_Computer_Exchange_VAX_PC_PDP11_Values.pdf

Comparing with pdfimages (part of xpdf / popper):

$ pdfimages -list samples/Newman_Computer_Exchange_VAX_PC_PDP11_Values.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    1248  1631  rgb     3   8  jpeg   no         8  0   150   150  216K 3.6%
   2     1 image    1259  1629  rgb     3   8  jpeg   no        14  0   150   150  251K 4.2%
   3     2 image    1239  1638  rgb     3   8  jpeg   no        20  0   150   150  198K 3.3%
   4     3 image    1256  1634  rgb     3   8  jpeg   no        26  0   150   150  166K 2.8%
$
$ python3
>>> import fitz
fitz.Document('samples/Newman_Computer_Exchange_VAX_PC_PDP11_Values.pdf')
>>> doc = fitz.open("samples/Newman_Computer_Exchange_VAX_PC_PDP11_Values.pdf")
>>> imgdict = doc.extractImage(8)
>>> imgdict['xres']
96
>>> imgdict['yres']
96

Is this a MuPDF bug? Are xres/yres hardcoded to 96?

JorjMcKie commented 4 years ago

Hi, Thank you very much for finding PyMuPDF useful!

Is this a MuPDF bug? Are xres/yres hardcoded to 96?

I need to check this. I am just passing through what the MuPDF API is giving me. Similar for opening images via the document interface ... I'll be back.

JorjMcKie commented 4 years ago

First finding: When creating pixmaps from either files or memory (bytes) or PDF document xrefs, then the resp. MuPDF functions simply seem to ignore information, which is actually there: in each of these cases, the pixmap is created from an intermediate internal MuPDF object type, called "image" (in PyMuPDF, I have not reflected this object type as a Python class). I am experimenting currently to simply copy over the information from the image to the pixmap, which seems to work ... so far. Continuing with more image types here.

A remark about the document interface for images: Documents have no such attributes as xres, yres. They do not know their origin ... at least as far as I have reflected this in the attributes of documents.

Still checking why method extractImage would not deliver the correct values.

JorjMcKie commented 4 years ago

In your example PDF, none of the 4 images contain resolution information. I verified this by extracting the raw streams and let another image reader analyze them (IrfanView): those two fields are empty in all 4 cases. So what MuPDF does in this situation is providing what they believe to be a "sane" standard: 96. I can only assume that pdftoimages does the same thing here. I recall from using Xpdf, that its standard resolution is 150 dpi - so maybe this their "sane" assumption.

Independent from this, there is an issue with the correct values provided by extractImage and Pixmap(doc, xref). Both of them ultimately call the same MuPDF function, which must be the hog.

So this is where I am now with my modifications:

wohali commented 4 years ago

Thanks for the info! I'll be sure to grab a new sample PDF with proper resolution info in it once this fix lands for my own testing, though it's interesting that this PDF provided a different, yet still entirely important test :laughing:

wohali commented 4 years ago

@JorjMcKie After re-reading, I have two questions:

  1. If I'm creating a Pixmap from an image that doesn't have xres/yres stored, is there a way I can manually override the default "96" value to something else, if I know better (or want to force a specific value)?
  2. Your preferred code approach (that is faster) requires creating a doc directly from a bitmap, as per my code above. Must I go through the slower, space-hogging version if I want to manually specify the xres/yres for an image, or is there a way I can manipulate the xres/yres fields in the PDF metadata tree thru the library?
JorjMcKie commented 4 years ago

If I'm creating a Pixmap from an image that doesn't have xres/yres stored, is there a way I can manually override the default "96" value to something else, if I know better (or want to force a specific value)?

I will provide a new Pixmap method for this: Pixmap.setResolution(xres, yres), which will set those values in the respective C structure shadowing the Python class. You can aleady now set Pixmap.xres = 150, but that will be known to Python only ... so if you save the image, it won't be reflected there. Hopefully it will with the new method -> subject to testing.

Your preferred code approach (that is faster) requires creating a doc directly from a bitmap, as per my code above. Must I go through the slower, space-hogging version if I want to manually specify the xres/yres for an image, or is there a way I can manipulate the xres/yres fields in the PDF metadata tree thru the library?

Of course there is nothing that prevents you from putting information in the metadata of a PDF once you have created one from an image document. I would wonder though if that would serve any purpose apart from sheer documentation. You could also do this:

# imgbytes is an image in memory  ... bytes or bytearray
img_dict = fitz.TOOLS.image_profile(imgbytes)  # returns basic image properties
# returns None if unsuccessful
width = img_dict["width"]
height = img_dict["height"]
xres = img_dict["xres"]  # (*)
yres = img_dict["yres"]  # (*)
page = doc.newPage(width=width, height=height)
page.insertImage(page.rect, stream=imgbytes)
# insertImage internally also uses TOOLS.image_profile

Please note that (the currently undocumented) TOOLS.image_profile does not yet (v1.16.16) have the above info marked with (*), but will in the next version. Independent from this: the image stored from imgbytes in the PDF will contain all image information (img type, resolution, etc).

Keys in dictionary TOOLS.image_profile: "width", "height", "xres" (new), "yres" (new), "colorspace", "bpc", "format", "ext", "size". I actually created this function to enable me correctly rotating images in method page.insertImage( ..., rotate=deg). But I see now, that there is potential to use it beyond that scope ...

I already have modified Pixmap(doc, xref) to make use of it. When image_profile returns None, then this only means that some special-case image has been encountered (uncompressed images, Fax encodings and what not). It should always work for the documented input image types:

The following file types are supported as input to construct pixmaps: BMP, JPEG, GIF, TIFF, JXR, JPX, PNG, PAM and all of the Portable Anymap family (PBM, PGM, PNM, PPM).

JorjMcKie commented 4 years ago

More or less finished with testing the announced / discussed changes.

Also implemented this Pixmap.setResolution method. Unfortunately, saving a pixmap to any of the supported output formats will not store this information. If you want to save your dpi changes to an output image, you must use a package like PIL. It supports e.g. a save parameter dpi=(xres, yres) and many others.

This restriction only pertains to pixmap saves. As mentioned before: saved images which have been extracted before via Document.extractImage(xref) carry all information as stored in the PDF, including any dpi info.

Any urgency for publishing the new version?

wohali commented 4 years ago

@JorjMcKie I can wait a week or so if you have another version pending. Thanks!

JorjMcKie commented 4 years ago

@wohali - if you have a Linux or Mac, you can download a v1.16.17 wheel from here. This is where the Travis generator stores them for the two platforma - look in branches linux or osx. If you are using Windows, then I can upload your wheel using this channel, because I generate those locally on my machine.

You may want to test the new version a little and provide feedback if you see areas for improvement, before I actually publish it.

JorjMcKie commented 4 years ago

should be addressed by version 1.16.17 uploaded today

wohali commented 4 years ago

Hey @JorjMcKie , sorry I haven't gotten back to this - work has exploded in the last week and I barely have any free time. I'll try and get you feedback soon, but it looks like the info above + https://github.com/pymupdf/PyMuPDF/commit/715a017301b195fd8707df20e9edb9238f941de9 will give me what I need to get moving with this.

Thanks again!

JorjMcKie commented 4 years ago

No problem at all. Just open another issue if / when necessary.

JorjMcKie commented 4 years ago

Hi @wohali - you may be interested to know, that the latest v1.18.0 automatically sets the dpi for PNG images created from pixmap values pixmap.xres / pixmap.yres.

wohali commented 4 years ago

@JorjMcKie Wow, very nice! Thanks.