jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Cannot .to_image() a FilteredPage class instance. #784

Closed jamiejcole closed 1 year ago

jamiejcole commented 1 year ago

Describe the bug

Trying to .to_image() a FilteredPage object, as I don't want to include bold text within the image export. However, the bold 1 remains in the image.

image

Code to reproduce the problem

coords = (0, topY, PAGE.width, bottomY)
clean_text = PAGE.filter(lambda obj: obj["object_type"] == "char" and "Bold" not in obj["fontname"])

croppedImage = clean_text.crop(coords)
image = croppedImage.to_image()
if temp:
    file = open(f'img/tmp/tmp-{questionNumber}.png', 'wb')
else:
    file = open(f'img/OUT-{questionNumber}.png', 'wb')
image.save(file, format="PNG")

PDF file

Relevant page: SDDPDF.pdf

Expected behavior

The clean_text FilteredPage should remove the bold 1 as seen in the screenshot of the PDF.

Actual behavior

The 1 remains within the .to_image() export.

jamiejcole commented 1 year ago

When trying to create my own instance of a Page, I just get the error AttributeError: 'FilteredPage' object has no attribute 'attrs'

clean_text = PAGE.filter(lambda obj: obj["object_type"] == "char" and "Bold" not in obj["fontname"])
filename = Path(f"./path-to-pdf.pdf")
import pdfplumber.page
x = pdfplumber.page.Page(filename, clean_text, 3) # 3 is the page number that this image is on

croppedImage = x.crop(coords)
jsvine commented 1 year ago

Thanks for filing this, @jamiejcole. You've identified a point I should clarify in the documentation: .to_image(...) does not and (unfortunately, with the current architecture) cannot take .filter(...)-introduced changes into account. That's because the .to_image(...) method just hands off the PDF and page number to Wand for rendering, which pdfplumber then crops if necessary (e.g., if it's a CroppedPage).

I've now added a note to the README.md file: https://github.com/jsvine/pdfplumber/commit/dbaf0cce5332475f7cd259cdc13777e6b943b20e