Open RyanCPeters opened 4 years ago
Also, I should mention that I'm not intimately familiar with python 2, and didn't test if these recommended changes are compatible with python 2.7
If they are in conflict with backward compatibility, then please mark this issue as closed. Thanks again for making a super convenient tool!
Thanks for raising this issue, @RyanCPeters! Between the various edits — which is fine! — I'm trying to understand the current goal/value of the proposal here. Is the core idea that explicitly annotating these properties with @property
and @*.setter
makes the functionality clearer? Or do the changes add functionality beyond that?
Separately, I'm thinking there may be a clearer, more succinct property name than .underlying_image
(because I wondered, "underlying what?). Perhaps:
.im
(as Image objects are often referenced in Pillow documentation, though perhaps this is too cryptic).image
(although PageImage.image
may be a bit confusing).pil_obj
(a bit clearer about what it is, but only if you're familiar with PIL/Pillow)Hey @jsvine, thanks for taking a look at this suggestion! The intention for this suggestion was to promote the use of class properties so that it may be easier to create subclasses to facilitate interfacing between pdfplumber
and different image manipulation libraries in the future.
I'm only just seeing your response, so I'll take some time this evening to reacquaint myself with what my thoughts were when I wrote the initial suggestion, and I should have a proper response for you by tomorrow ;)
Did this get resolved in a useful way? I like the table extraction, but I am using it for scanned pdf's which I want to extract the tables from and crop to them, and then export to read with tesseract. I could export into an image file (png, jpg etc) and then read, but this is kind of funny. So I would really like to just take the image into PIL or OpenCV. I tried Ryan's code but while it seemed to give the attribute at the coding stage, the code failed with an error AttributeError: 'PageImage' object has no attribute 'underlying_image'
@mbhoshen and thanks for your interest in the library. A few notes:
I like the table extraction, but I am using it for scanned pdf's which I want to extract the tables from and crop to them
Just a warning: pdfplumber
is optimized for digital PDFs, rather than scanned ones. It's likely/possible that you'll see subpar results.
So I would really like to just take the image into PIL or OpenCV.
You can do this, via my_page.to_image().original
, which will return a PIL/Pillow Image object.
I tried Ryan's code but while it seemed to give the attribute at the coding stage, the code failed with an error
Ryan's code has not been incorporated into pdfplumber
, which is likely why you're encountering an error. But, as noted above, you don't need that code to access the underlying PIL/Pillow Image objects.
Thanks !!!
בתאריך יום ג׳, 20 באוק׳ 2020, 03:23, מאת Jeremy Singer-Vine < notifications@github.com>:
@mbhoshen https://github.com/mbhoshen and thanks for your interest in the library. A few notes:
I like the table extraction, but I am using it for scanned pdf's which I want to extract the tables from and crop to them
Just a warning: pdfplumber is optimized for digital PDFs, rather than scanned ones. It's likely/possible that you'll see subpar results.
So I would really like to just take the image into PIL or OpenCV.
You can do this, via my_page.to_image().original, which will return a PIL/Pillow Image object.
I tried Ryan's code but while it seemed to give the attribute at the coding stage, the code failed with an error
Ryan's code has not been incorporated into pdfplumber, which is likely why you're encountering an error. But, as noted above, you don't need that code to access the underlying PIL/Pillow Image objects.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jsvine/pdfplumber/issues/186#issuecomment-712513339, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBFRZREJHM2DG4ZD6WOH3TSLTKAXANCNFSM4KU3QS6Q .
@jsvine sorry for ghosting there, but ive looked back at my old notes when i was messing with the issue initially and simplified the suggestion to the following.
The reasoning for using getter/setter properties is to improve and simplify the separation and encapsulation of image processing from the pdf parsing logic. This should untie the project from specifically depending on PIL so that users can more easily create subclasses that implement application specific image procssing tools.
Im writing this from my phone while i wait for anither appointment, so posting example code is kinda hard for the moment. I'll plan to post a more illustrative example when i get home later.
@jsvine Ok, after creating an example that was almost purely abstract (as in it tried to only illustrate suggested changes) I decided that it would take more work to communicate than to just implement the changes outright :)
So I've pulled a fork of the repo and implemented the necessary helper classes to encapsulate the image processing functionality of .display.PageImage
and created a small demonstration of how to subclass this functionality to create custom tools for image processing that don't interfere or modify any of the pdf processing logic.
These examples can be seen here: https://github.com/RyanCPeters/pdfplumber/tree/stable/examples/subclassing_example
Having said that, I did have to make small changes to how the .pdf.PDF
class and .page.Page
(and the subclasses of Page) are initialized in order to allow a user to specify if they want to use the default image processing or their custom implementation.
Thank you, @RyanCPeters for this very detailed response / repo. I really appreciate your attention to it. I am, however, still a bit confused about the bigger picture (pun not intended but retained).
In my original conception (not to say that there aren't better approaches), PageImage
exists in pdfplumber
for the primary/sole goal of helping people to debug the library's output, such as char
/rect
/etc. positions and TableFinder
results. As such, I didn't think people would care much about what image-generation/annotation library was used under the hood, since pdfplumber
abstracts those specifics via the .draw_*(...)
methods.
This issue here suggests you actually want to use the results of PageImage
in more varied ways. That I understand. And yet: What are the advantages of subclassing PageImage to use a non-PIL library that, for instance, simply saving the underlying PageImage
image to file (or BytesIO
), and then reading it with your image-library-of-choice, could not achieve?
I appreciate the general advantages of abstracting classes; on the other hand, this would lead to more code to maintain and more promises to keep in future versions (e.g., guaranteeing that pdfplumber
will continue to provide a consistent interface for those subclasses). So I'm not opposed to implementing something like what you suggest, but I do want to get a better sense of the advantages/disadvantages and particular roadblocks you see with the current setup.
To that end, could you (or other interested parties) provide a simple functioning code example that demonstrates where this abstraction would be particularly helpful?
(I recognize and appreciate that your linked repo provides code examples of how to use your subclassed approach, but I'm more interested in code examples that demonstrate where pdfplumber
's current approach falls short.)
This is an enhancement suggestion along with a justification based on my own experience.
Hello and thank you for developing such a useful tool! I've been using it to extract tables of assembly opcodes from pdf for later use in automatic generation of jump tables in a class assignment, it's been very nice.
However, I had to jump through a bunch of hoops in order to use the
pdfplumber.Page.to_image(...)
function outside the context of a Jupyter notebook. I wanted to use the images gotten for each table in the pdf file in a Plotly-dashboard display that lets me view the extracted table data alongside images of the pdf's source tables. But without access to the PIL image instance, I was stuck until I finally decided to modify my local environment's pdfplumber/display.py file. This involved modifying thePageImage
class to expose the underlyingPIL.Image.Image
instance via property getter/setter.EDIT: I forgot to mention that I did eventually notice that the self.annotated and self.original exposed the PIL.Image.Image instance. However, the code doesn't self-document this fact. The recommended changes below should make it explicitly clear to futures users how they can extract the PIL image from PageImage.
These changes are pretty easy to do, so if you want to hit them yourself that's cool; or If you'd like I can fork the repo, make those changes and then issue a pull request?
Here's a minimal version of the changes I made and a few terce explanations for them.
EDIT: Initial code sample showed the alias member I made for testing, self._annotated, instead of the correct member self.annotated, that's been corrected.
EDIT: I missed correcting the template variable name,
value
to the self-documenting nameimage_or_mode
in a couple of places, that's also corrected now.