pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.8k stars 536 forks source link

fitz.image_profile(document.xref_stream_raw($NUM)) returning None instead of {} #2501

Closed ferdnyc closed 1 year ago

ferdnyc commented 1 year ago

Describe the bug (mandatory)

When attempting to examine a PDF page's embedded (what I'm willing to concede/presume to be) "exotic" image, by xref using the recommended fitz.image_profile(document.xref_stream_raw(xref_num)) call, the return value is None despite the documentation claiming...

No exception is ever raised: in case of error, the empty dictionary {} is returned.

Unfortunately I don't have a sample file to provide, as I can't share the one for which this occurs. I apologize for that, and I'm filing this report with the explicit understanding that it is incomplete and may not be actionable. I will work on finding or creating an appropriate sample.

Pending that, I decided to at least get the report filed. I'll provide as much detail as possible (and more if requested), in the hope that it might be sufficient to track down the issue.

To Reproduce (mandatory)

>>> d = fitz.open("/path/to/the.pdf")
>>> p = d[0]
>>> p.get_images()
[(370, 0, 954, 1467, 8, 'DeviceRGB', '', 'Im0', 'FlateDecode')]
>>> fitz.image_profile(d.xref_stream_raw(370))
>>> fitz.image_profile(d.xref_stream_raw(370)) is None
True
>>> img.keys()
dict_keys(['ext', 'smask', 'width', 'height', 'colorspace', 'bpc', 'xres', 'yres', 'cs-name', 'image'])
>>> from pprint import pp
>>> pp({k: img[k] for k in ['ext', 'width', 'height', 'colorspace', 'xres', 'yres']})
{'ext': 'png',
 'width': 954,
 'height': 1467,
 'colorspace': 3,
 'xres': 96,
 'yres': 96}

The slight lag in extract_image() returning makes me think that the image is being converted to PNG from some "exotic" format, as documented. My only concern is image_profile() returning None, instead of {} as it's supposed to.

This is what the xref looks like, it seems quite odd to me. Still, extract_image() does produce a valid PNG if its image key is saved to disk.

>>> print(d.xref_object(370))
<<
  /BitsPerComponent 8
  /ColorSpace /DeviceRGB
  /DecodeParms <<
    /BitsPerComponent 8
    /Colors 3
    /Columns 954
    /Predictor 15
  >>
  /Filter /FlateDecode
  /Height 1467
  /Subtype /Image
  /Type /XObject
  /Width 954
  /Length 1792019
>>

And here's the page object:

>>> d.page_xref(0)
369
>>> print(d.xref_object(369))
<<
  /Contents 371 0 R
  /MediaBox [ 0 0 715.5 1100.25 ]
  /Parent 364 0 R
  /Resources <<
    /XObject <<
      /Im0 370 0 R
    >>
  >>
  /Type /Page
>>

Expected behavior (optional)

fitz.image_profile(...) returns an empty dict, {}.

Your configuration (mandatory)

>>> print(f"{sys.version}\n{distro.name(True)}\n{fitz.__doc__}")
3.11.3 (main, May 24 2023, 00:00:00) [GCC 13.1.1 20230511 (Red Hat 13.1.1-2)]
Fedora Linux 38 (Thirty Eight)

PyMuPDF 1.22.3: Python bindings for the MuPDF 1.22.0 library.
Version date: 2023-05-10 00:00:01.
Built for Python 3.11 on linux (64-bit).

PyMuPDF was installed from the distro python3-PyMuPDF-1.22.3-2.fc38.x86_64.rpm package.

JorjMcKie commented 1 year ago

Well, I looked into the source - and indeed there are multiple occasions when None is returned. For me, it looks like a documentation mismatch. How would you react if we simply say None is returned if a problem is encountered?

JorjMcKie commented 1 year ago

BTW why do you say the PDF object definition looks weird? For me, there is no obvious problem ... You could try extracting the stream in addition to the raw stream. Some images need a decompression before they can be recognized. doc.extract_image(xref) has some logic to dig its way through the various alternatives of whether taking the decompressed stream as opposed to the raw stream. Especially relevant when multiple filters have been used to come up with the binary that is ultimately stored in a PDF.

ferdnyc commented 1 year ago

@JorjMcKie

How would you react if we simply say None is returned if a problem is encountered?

I mean, that's the reality of things so even if this were fixed, supporting older versions would require dealing with that possibility.

I've already worked around the issue in my code, although it did involve adding code to check for a None result.

Originally, based on the documentation, my class's method that it called to decide what extension to use, when it needs to save an embedded image to a file, looked like this:

def _get_image_extension(self, xref):
    img = fitz.image_profile(self.doc.xref_stream_raw(xref))
    return img.get('ext', 'png')

The more complicated version that replaced it, which has to perform an if img: check before calling .get(), and then handle the defaulting of the return value to png outside of the lookup, isn't especially heinous. Just less elegant.

Looking at it pragmatically, the two options I see are:

  1. Just change the documentation to say None is returned (and ideally make that consistent, if it isn't)
  2. Fix the code to match the docs, and change the documentation to say that since version 1.2x, a dict (possibly empty) is always returned, but earlier versions sometimes returned None instead.
ferdnyc commented 1 year ago
  1. change the documentation to say None is returned (and ideally make that consistent, if it isn't)

(I suppose that's not really a hard requirement, because an if img: on an empty dict will still fail the condition.)

Scratch that, the likelihood of some code using an if img is not None: test instead still makes a case for consistency.

ferdnyc commented 1 year ago

BTW why do you say the PDF object definition looks weird? For me, there is no obvious problem ...

Oh, I didn't say problem. (Or at least, I don't recall saying it.) I've never personally encountered a /DecodeParms block like that attached to an image before. So it struck me as unusual, and I presumed had something to do with image_profile()'s failure to extract any info.

Sounds like I was probably right about the latter, though perhaps not about it being at all unusual. My experience with PDF files' internal structure (and variations thereof) is probably still pretty narrow.

JorjMcKie commented 1 year ago

This extra info is probably redundant anyway. Never mind. I recommend you try image_profile(doc.object_stream(xref)) instead of doc.object_stream_raw(xref). I bet it works.

ferdnyc commented 1 year ago

@JorjMcKie I'm afraid now I'm (even more) confused.

>>> import fitz
>>> d = fitz.open(...)
>>> d.page_count
122
>>> d[0].get_images()
[(370, 0, 954, 1467, 8, 'DeviceRGB', '', 'Im0', 'FlateDecode')]
>>> fitz.image_profile(d.object_stream(370))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'Document' object has no attribute 'object_stream'
ferdnyc commented 1 year ago

@JorjMcKie

Oh, sorry, you meant document.xref_stream(). Nope, image_profile returns the same None with that as for xref_stream_raw().

JorjMcKie commented 1 year ago

Hm - unexpected. I would need the PDF to clarify.

JorjMcKie commented 1 year ago

Fixed.