Closed wohali closed 4 years ago
Actually, I'm seeing a bug here as well.
Reference this PDF: https://github.com/wohali/hough/blob/master/samples/Newman_Computer_Exchange_VAX_PC_PDP11_Values.pdf
Comparing with pdfimages
(part of xpdf / popper):
$ pdfimages -list samples/Newman_Computer_Exchange_VAX_PC_PDP11_Values.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 1248 1631 rgb 3 8 jpeg no 8 0 150 150 216K 3.6%
2 1 image 1259 1629 rgb 3 8 jpeg no 14 0 150 150 251K 4.2%
3 2 image 1239 1638 rgb 3 8 jpeg no 20 0 150 150 198K 3.3%
4 3 image 1256 1634 rgb 3 8 jpeg no 26 0 150 150 166K 2.8%
$
$ python3
>>> import fitz
fitz.Document('samples/Newman_Computer_Exchange_VAX_PC_PDP11_Values.pdf')
>>> doc = fitz.open("samples/Newman_Computer_Exchange_VAX_PC_PDP11_Values.pdf")
>>> imgdict = doc.extractImage(8)
>>> imgdict['xres']
96
>>> imgdict['yres']
96
Is this a MuPDF bug? Are xres/yres hardcoded to 96?
Hi, Thank you very much for finding PyMuPDF useful!
Is this a MuPDF bug? Are xres/yres hardcoded to 96?
I need to check this. I am just passing through what the MuPDF API is giving me. Similar for opening images via the document interface ... I'll be back.
First finding:
When creating pixmaps from either files or memory (bytes
) or PDF document xrefs, then the resp. MuPDF functions simply seem to ignore information, which is actually there: in each of these cases, the pixmap is created from an intermediate internal MuPDF object type, called "image" (in PyMuPDF, I have not reflected this object type as a Python class).
I am experimenting currently to simply copy over the information from the image to the pixmap, which seems to work ... so far.
Continuing with more image types here.
A remark about the document interface for images: Documents have no such attributes as xres, yres. They do not know their origin ... at least as far as I have reflected this in the attributes of documents.
Still checking why method extractImage
would not deliver the correct values.
In your example PDF, none of the 4 images contain resolution information. I verified this by extracting the raw streams and let another image reader analyze them (IrfanView): those two fields are empty in all 4 cases. So what MuPDF does in this situation is providing what they believe to be a "sane" standard: 96. I can only assume that pdftoimages does the same thing here. I recall from using Xpdf, that its standard resolution is 150 dpi - so maybe this their "sane" assumption.
Independent from this, there is an issue with the correct values provided by extractImage
and Pixmap(doc, xref)
. Both of them ultimately call the same MuPDF function, which must be the hog.
So this is where I am now with my modifications:
fitz.Pixmap(file)
and fitz.Pixmap(memory)
correctly provide xres, yres stored with the image - except for JPEG 2000 images (JPX / JP2).extractImage(xref)
and fitz.Pixmap(doc, xref)
always return the "sane" values 96.Thanks for the info! I'll be sure to grab a new sample PDF with proper resolution info in it once this fix lands for my own testing, though it's interesting that this PDF provided a different, yet still entirely important test :laughing:
@JorjMcKie After re-reading, I have two questions:
If I'm creating a Pixmap from an image that doesn't have xres/yres stored, is there a way I can manually override the default "96" value to something else, if I know better (or want to force a specific value)?
I will provide a new Pixmap method for this: Pixmap.setResolution(xres, yres)
, which will set those values in the respective C structure shadowing the Python class. You can aleady now set Pixmap.xres = 150
, but that will be known to Python only ... so if you save the image, it won't be reflected there. Hopefully it will with the new method -> subject to testing.
Your preferred code approach (that is faster) requires creating a doc directly from a bitmap, as per my code above. Must I go through the slower, space-hogging version if I want to manually specify the xres/yres for an image, or is there a way I can manipulate the xres/yres fields in the PDF metadata tree thru the library?
Of course there is nothing that prevents you from putting information in the metadata of a PDF once you have created one from an image document. I would wonder though if that would serve any purpose apart from sheer documentation. You could also do this:
# imgbytes is an image in memory ... bytes or bytearray
img_dict = fitz.TOOLS.image_profile(imgbytes) # returns basic image properties
# returns None if unsuccessful
width = img_dict["width"]
height = img_dict["height"]
xres = img_dict["xres"] # (*)
yres = img_dict["yres"] # (*)
page = doc.newPage(width=width, height=height)
page.insertImage(page.rect, stream=imgbytes)
# insertImage internally also uses TOOLS.image_profile
Please note that (the currently undocumented) TOOLS.image_profile
does not yet (v1.16.16) have the above info marked with (*), but will in the next version.
Independent from this: the image stored from imgbytes
in the PDF will contain all image information (img type, resolution, etc).
Keys in dictionary TOOLS.image_profile
:
"width", "height", "xres" (new), "yres" (new), "colorspace", "bpc", "format", "ext", "size".
I actually created this function to enable me correctly rotating images in method page.insertImage( ..., rotate=deg)
. But I see now, that there is potential to use it beyond that scope ...
I already have modified Pixmap(doc, xref)
to make use of it.
When image_profile
returns None, then this only means that some special-case image has been encountered (uncompressed images, Fax encodings and what not). It should always work for the documented input image types:
The following file types are supported as input to construct pixmaps: BMP, JPEG, GIF, TIFF, JXR, JPX, PNG, PAM and all of the Portable Anymap family (PBM, PGM, PNM, PPM).
More or less finished with testing the announced / discussed changes.
Also implemented this Pixmap.setResolution
method. Unfortunately, saving a pixmap to any of the supported output formats will not store this information. If you want to save your dpi changes to an output image, you must use a package like PIL. It supports e.g. a save parameter dpi=(xres, yres)
and many others.
This restriction only pertains to pixmap saves. As mentioned before: saved images which have been extracted before via Document.extractImage(xref)
carry all information as stored in the PDF, including any dpi info.
Any urgency for publishing the new version?
@JorjMcKie I can wait a week or so if you have another version pending. Thanks!
@wohali - if you have a Linux or Mac, you can download a v1.16.17 wheel from here. This is where the Travis generator stores them for the two platforma - look in branches linux
or osx
.
If you are using Windows, then I can upload your wheel using this channel, because I generate those locally on my machine.
You may want to test the new version a little and provide feedback if you see areas for improvement, before I actually publish it.
should be addressed by version 1.16.17 uploaded today
Hey @JorjMcKie , sorry I haven't gotten back to this - work has exploded in the last week and I barely have any free time. I'll try and get you feedback soon, but it looks like the info above + https://github.com/pymupdf/PyMuPDF/commit/715a017301b195fd8707df20e9edb9238f941de9 will give me what I need to get moving with this.
Thanks again!
No problem at all. Just open another issue if / when necessary.
Hi @wohali - you may be interested to know, that the latest v1.18.0 automatically sets the dpi for PNG images created from pixmap values pixmap.xres
/ pixmap.yres
.
@JorjMcKie Wow, very nice! Thanks.
Hi there,
PyMuPDF is great! Thanks for the best PDF support in Python on the planet.
I'm currently working on some code to straighten scanned images in PDFs. I use PyMuPDF, imageio and numpy/scikit to do most of the heavy lifting.
Following the guide, I can't figure out how restore the original xres/yres of the extracted image.
Code excerpt, very similar to your example:
I think the issue is in the
fitz.open()
call. I don't see any way to pass inxres/yres
(ppi) options to the constructor. No matter what I do, I end up with 96ppi. (Certain "fruit-branded" PDF readers care about this value a lot.)How do I retain the original
xres/yres
values when creating a new document from an image byte stream?