Expose the underlying PIL image instance through property getter/setter

RyanCPeters commented 4 years ago

This is an enhancement suggestion along with a justification based on my own experience.

Hello and thank you for developing such a useful tool! I've been using it to extract tables of assembly opcodes from pdf for later use in automatic generation of jump tables in a class assignment, it's been very nice.

However, I had to jump through a bunch of hoops in order to use the pdfplumber.Page.to_image(...) function outside the context of a Jupyter notebook. I wanted to use the images gotten for each table in the pdf file in a Plotly-dashboard display that lets me view the extracted table data alongside images of the pdf's source tables. But without access to the PIL image instance, I was stuck until I finally decided to modify my local environment's pdfplumber/display.py file. This involved modifying the PageImage class to expose the underlying PIL.Image.Image instance via property getter/setter.

EDIT: I forgot to mention that I did eventually notice that the self.annotated and self.original exposed the PIL.Image.Image instance. However, the code doesn't self-document this fact. The recommended changes below should make it explicitly clear to futures users how they can extract the PIL image from PageImage.

These changes are pretty easy to do, so if you want to hit them yourself that's cool; or If you'd like I can fork the repo, make those changes and then issue a pull request?

Here's a minimal version of the changes I made and a few terce explanations for them.

# inside the display.py file
# ...

class PageImage(object):
    """Functions not shown here are simply left unchainged;

        The changes shown below move the function body from self.reset() into the new property
        setter for self.underlying_image

        We then tell self.reset() to call self.underlying_image = <some desired default value>

         Calling the self.underlying_image property allows the user to easily get the PIL.Image.Image 
         instance associated with the current state of the PageImage class instance. This allows users
         to easily analyze and manipulate the image data in libraries like opencv/numpy/plotly. 

         I used this modification to build a dashboard display of the extracted table data along side
         the source image data gathered from the pdf via pdfplumber"""

    @property
    def underlying_image(self):
        """returns the underlying PIL.Image.Image instance data for this PageImage object."""
        return self._annotated

    @underlying_image.setter
    def underlying_image(self, image_or_mode:Union[str,PIL.Image.Image])->None:
        """Accepts image_or_mode as either a string describing what image mode [PNG,TIF,GIF,... etc.]
        the image should be opened in, or as an already opened image which we will simply set as the
        value for self._annotated.

        :param image_or_mode: A string or PIL.Image.Image instance that we can use to properly build
                              or update our now protected self._annotated member.
        :type image_or_mode: Union[str,PIL.Image.Image]
        :return: None
        :rtype: None
        """
        if isinstance(image_or_mode,PIL.Image.Image):
            self._annotated = image_or_mode
        else:
            self._annotated = PIL.Image.new(self.original.mode, self.original.size)
            self._annotated.paste(self.original)
            if (isinstance(image_or_mode,str) and image_or_mode in self._valid_image_formats):
                self.draw = PIL.ImageDraw.Draw(self._annotated, image_or_mode)
            else:
                self.draw = PIL.ImageDraw.Draw(self._annotated, "RGBA")

    def reset(self):
        self.underlying_image = "RGBA" # can also just pass None
        return self

EDIT: Initial code sample showed the alias member I made for testing, self._annotated, instead of the correct member self.annotated, that's been corrected.

EDIT: I missed correcting the template variable name, value to the self-documenting name image_or_mode in a couple of places, that's also corrected now.

RyanCPeters commented 4 years ago

Also, I should mention that I'm not intimately familiar with python 2, and didn't test if these recommended changes are compatible with python 2.7

If they are in conflict with backward compatibility, then please mark this issue as closed. Thanks again for making a super convenient tool!

jsvine commented 4 years ago

Thanks for raising this issue, @RyanCPeters! Between the various edits — which is fine! — I'm trying to understand the current goal/value of the proposal here. Is the core idea that explicitly annotating these properties with @property and @*.setter makes the functionality clearer? Or do the changes add functionality beyond that?

Separately, I'm thinking there may be a clearer, more succinct property name than .underlying_image (because I wondered, "underlying what?). Perhaps:

.im (as Image objects are often referenced in Pillow documentation, though perhaps this is too cryptic)
.image (although PageImage.image may be a bit confusing)
.pil_obj (a bit clearer about what it is, but only if you're familiar with PIL/Pillow)
other ideas?

RyanCPeters commented 4 years ago

Hey @jsvine, thanks for taking a look at this suggestion! The intention for this suggestion was to promote the use of class properties so that it may be easier to create subclasses to facilitate interfacing between pdfplumber and different image manipulation libraries in the future.

I'm only just seeing your response, so I'll take some time this evening to reacquaint myself with what my thoughts were when I wrote the initial suggestion, and I should have a proper response for you by tomorrow ;)

mbhoshen commented 3 years ago

Did this get resolved in a useful way? I like the table extraction, but I am using it for scanned pdf's which I want to extract the tables from and crop to them, and then export to read with tesseract. I could export into an image file (png, jpg etc) and then read, but this is kind of funny. So I would really like to just take the image into PIL or OpenCV. I tried Ryan's code but while it seemed to give the attribute at the coding stage, the code failed with an error AttributeError: 'PageImage' object has no attribute 'underlying_image'

jsvine commented 3 years ago

@mbhoshen and thanks for your interest in the library. A few notes:

I like the table extraction, but I am using it for scanned pdf's which I want to extract the tables from and crop to them

Just a warning: pdfplumber is optimized for digital PDFs, rather than scanned ones. It's likely/possible that you'll see subpar results.

So I would really like to just take the image into PIL or OpenCV.

You can do this, via my_page.to_image().original, which will return a PIL/Pillow Image object.

I tried Ryan's code but while it seemed to give the attribute at the coding stage, the code failed with an error

Ryan's code has not been incorporated into pdfplumber, which is likely why you're encountering an error. But, as noted above, you don't need that code to access the underlying PIL/Pillow Image objects.

mbhoshen commented 3 years ago

Thanks !!!

בתאריך יום ג׳, 20 באוק׳ 2020, 03:23, מאת Jeremy Singer-Vine ‏< notifications@github.com>:

@mbhoshen https://github.com/mbhoshen and thanks for your interest in the library. A few notes:

I like the table extraction, but I am using it for scanned pdf's which I want to extract the tables from and crop to them

Just a warning: pdfplumber is optimized for digital PDFs, rather than scanned ones. It's likely/possible that you'll see subpar results.

So I would really like to just take the image into PIL or OpenCV.

You can do this, via my_page.to_image().original, which will return a PIL/Pillow Image object.

I tried Ryan's code but while it seemed to give the attribute at the coding stage, the code failed with an error

Ryan's code has not been incorporated into pdfplumber, which is likely why you're encountering an error. But, as noted above, you don't need that code to access the underlying PIL/Pillow Image objects.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jsvine/pdfplumber/issues/186#issuecomment-712513339, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBFRZREJHM2DG4ZD6WOH3TSLTKAXANCNFSM4KU3QS6Q .

RyanCPeters commented 3 years ago

@jsvine sorry for ghosting there, but ive looked back at my old notes when i was messing with the issue initially and simplified the suggestion to the following.

The reasoning for using getter/setter properties is to improve and simplify the separation and encapsulation of image processing from the pdf parsing logic. This should untie the project from specifically depending on PIL so that users can more easily create subclasses that implement application specific image procssing tools.

Im writing this from my phone while i wait for anither appointment, so posting example code is kinda hard for the moment. I'll plan to post a more illustrative example when i get home later.

RyanCPeters commented 3 years ago

@jsvine Ok, after creating an example that was almost purely abstract (as in it tried to only illustrate suggested changes) I decided that it would take more work to communicate than to just implement the changes outright :)

So I've pulled a fork of the repo and implemented the necessary helper classes to encapsulate the image processing functionality of .display.PageImage and created a small demonstration of how to subclass this functionality to create custom tools for image processing that don't interfere or modify any of the pdf processing logic.

These examples can be seen here: https://github.com/RyanCPeters/pdfplumber/tree/stable/examples/subclassing_example

Having said that, I did have to make small changes to how the .pdf.PDF class and .page.Page (and the subclasses of Page) are initialized in order to allow a user to specify if they want to use the default image processing or their custom implementation.

jsvine commented 3 years ago

Thank you, @RyanCPeters for this very detailed response / repo. I really appreciate your attention to it. I am, however, still a bit confused about the bigger picture (pun not intended but retained).

In my original conception (not to say that there aren't better approaches), PageImage exists in pdfplumber for the primary/sole goal of helping people to debug the library's output, such as char/rect/etc. positions and TableFinder results. As such, I didn't think people would care much about what image-generation/annotation library was used under the hood, since pdfplumber abstracts those specifics via the .draw_*(...) methods.

This issue here suggests you actually want to use the results of PageImage in more varied ways. That I understand. And yet: What are the advantages of subclassing PageImage to use a non-PIL library that, for instance, simply saving the underlying PageImage image to file (or BytesIO), and then reading it with your image-library-of-choice, could not achieve?

I appreciate the general advantages of abstracting classes; on the other hand, this would lead to more code to maintain and more promises to keep in future versions (e.g., guaranteeing that pdfplumber will continue to provide a consistent interface for those subclasses). So I'm not opposed to implementing something like what you suggest, but I do want to get a better sense of the advantages/disadvantages and particular roadblocks you see with the current setup.

To that end, could you (or other interested parties) provide a simple functioning code example that demonstrates where this abstraction would be particularly helpful?

(I recognize and appreciate that your linked repo provides code examples of how to use your subclassed approach, but I'm more interested in code examples that demonstrate where pdfplumber's current approach falls short.)

jsvine / pdfplumber

Expose the underlying PIL image instance through property getter/setter #186