Closed ferdnyc closed 1 year ago
Well, I looked into the source - and indeed there are multiple occasions when None
is returned. For me, it looks like a documentation mismatch.
How would you react if we simply say None
is returned if a problem is encountered?
BTW why do you say the PDF object definition looks weird? For me, there is no obvious problem ...
You could try extracting the stream in addition to the raw stream. Some images need a decompression before they can be recognized.
doc.extract_image(xref)
has some logic to dig its way through the various alternatives of whether taking the decompressed stream as opposed to the raw stream. Especially relevant when multiple filters have been used to come up with the binary that is ultimately stored in a PDF.
@JorjMcKie
How would you react if we simply say
None
is returned if a problem is encountered?
I mean, that's the reality of things so even if this were fixed, supporting older versions would require dealing with that possibility.
I've already worked around the issue in my code, although it did involve adding code to check for a None
result.
Originally, based on the documentation, my class's method that it called to decide what extension to use, when it needs to save an embedded image to a file, looked like this:
def _get_image_extension(self, xref):
img = fitz.image_profile(self.doc.xref_stream_raw(xref))
return img.get('ext', 'png')
The more complicated version that replaced it, which has to perform an if img:
check before calling .get()
, and then handle the defaulting of the return value to png
outside of the lookup, isn't especially heinous. Just less elegant.
Looking at it pragmatically, the two options I see are:
None
is returned (and ideally make that consistent, if it isn't)None
instead.
- change the documentation to say
None
is returned (and ideally make that consistent, if it isn't)
(I suppose that's not really a hard requirement, because an if img:
on an empty dict will still fail the condition.)
Scratch that, the likelihood of some code using an if img is not None:
test instead still makes a case for consistency.
BTW why do you say the PDF object definition looks weird? For me, there is no obvious problem ...
Oh, I didn't say problem. (Or at least, I don't recall saying it.) I've never personally encountered a /DecodeParms
block like that attached to an image before. So it struck me as unusual, and I presumed had something to do with image_profile()
's failure to extract any info.
Sounds like I was probably right about the latter, though perhaps not about it being at all unusual. My experience with PDF files' internal structure (and variations thereof) is probably still pretty narrow.
This extra info is probably redundant anyway. Never mind.
I recommend you try image_profile(doc.object_stream(xref))
instead of doc.object_stream_raw(xref)
.
I bet it works.
@JorjMcKie I'm afraid now I'm (even more) confused.
>>> import fitz
>>> d = fitz.open(...)
>>> d.page_count
122
>>> d[0].get_images()
[(370, 0, 954, 1467, 8, 'DeviceRGB', '', 'Im0', 'FlateDecode')]
>>> fitz.image_profile(d.object_stream(370))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'Document' object has no attribute 'object_stream'
@JorjMcKie
Oh, sorry, you meant document.xref_stream()
. Nope, image_profile
returns the same None
with that as for xref_stream_raw()
.
Hm - unexpected. I would need the PDF to clarify.
Fixed.
Describe the bug (mandatory)
When attempting to examine a PDF page's embedded (what I'm willing to concede/presume to be) "exotic" image, by xref using the recommended
fitz.image_profile(document.xref_stream_raw(xref_num))
call, the return value isNone
despite the documentation claiming...Unfortunately I don't have a sample file to provide, as I can't share the one for which this occurs. I apologize for that, and I'm filing this report with the explicit understanding that it is incomplete and may not be actionable. I will work on finding or creating an appropriate sample.
Pending that, I decided to at least get the report filed. I'll provide as much detail as possible (and more if requested), in the hope that it might be sufficient to track down the issue.
To Reproduce (mandatory)
The slight lag in extract_image() returning makes me think that the image is being converted to PNG from some "exotic" format, as documented. My only concern is image_profile() returning
None
, instead of{}
as it's supposed to.This is what the xref looks like, it seems quite odd to me. Still,
extract_image()
does produce a valid PNG if itsimage
key is saved to disk.And here's the page object:
Expected behavior (optional)
fitz.image_profile(...)
returns an empty dict,{}
.Your configuration (mandatory)
PyMuPDF was installed from the distro
python3-PyMuPDF-1.22.3-2.fc38.x86_64.rpm
package.