jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

to_image of grey text results in a fully white image #443

Closed linuxsoftware closed 2 years ago

linuxsoftware commented 3 years ago

Thank you for this extremely useful library.

I had a problem with visual debugging of a PDF that was mostly grey. All the text turned white so it could not be seen.

Here is an example PDF.

The problem is ImageMagick creates the image of the page as a 16bit greyscale PNG, but Pillow has a documented issue with converting that to RGB. (See https://stackoverflow.com/questions/19892919/pil-converting-an-image-with-mode-i-to-rgb-results-in-a-fully-white-image and https://github.com/python-pillow/Pillow/issues/3011)

My hack has been to change display.py so that ImageMagick creates the image as an 8bit PNG using convert("png8"), which Pillow can then cope with. This "works for me".

--- a/pdfplumber/display.py
+++ b/pdfplumber/display.py
@@ -41,7 +41,7 @@ def get_page_image(stream, page_no, resolution):
         if img.alpha_channel:
             img.background_color = wand.image.Color("white")
             img.alpha_channel = "remove"
-        with img.convert("png") as png:
+        with img.convert("png8") as png:
             im = PIL.Image.open(BytesIO(png.make_blob()))
             return im.convert("RGB")

Environment

jsvine commented 3 years ago

Hi @linuxsoftware, and thanks for flagging this! Since the default seems to work well for most PDFs, I'd lean toward an approach that allows the user to specify the conversion mode via an argument passed to get_page_image(...) and Page.to_image(...). I'll put this on my todo list, though you're also welcome to submit a PR.

linuxsoftware commented 3 years ago

I was thinking about this and realized it is already possible to pass a user-created original image in to to_image so perhaps the code does not need to change at all.

e.g.

def my_page_image(page):
    stream = page.pdf.stream
    page_no = page.page_number - 1
    with wand.image.Image(resolution=150,
                          filename=f"{stream.name}[{page_no}]") as img:
        with img.convert("png8") as png:
            im = PIL.Image.open(BytesIO(png.make_blob()))
            return im.convert("RGB")

pi=page.to_image(original=my_page_image(page))

The main thing is for the user to realize the 8 bit limitation of Pillow when converting images. Perhaps it is enough that this conversation will now show up in searches, or perhaps it's worth a note in the Visual Debugging documentation?

jsvine commented 2 years ago

I believe that the latest version(s) of pdfplumber, which make some more generalized improvements/changes, now convert your PDF to an acceptable image:

tmp-grey