jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.31k stars 647 forks source link

to_image doesn't accept parameter "width" #798

Closed pseudomonas closed 1 year ago

pseudomonas commented 1 year ago

Describe the bug

page.to_image(resolution=300) works fine (for any value of 300)

page.to_image(width=1000) does not work, despite the docs referring to https://docs.wand-py.org/en/latest/wand/image.html#wand.image.Image which says that width should be an accepted parameter

Traceback (most recent call last):
  File "[FILE_PATH_REDACTED]", line 151, in <module>
    im = page.to_image(width=1000)
  File "/home/username/miniconda3/envs/envname/lib/python3.10/site-packages/pdfplumber/page.py", line 386, in to_image
    return PageImage(self, **kwargs)
TypeError: PageImage.__init__() got an unexpected keyword argument 'width'

Process finished with exit code 1

Environment

jsvine commented 1 year ago

Hi @pseudomonas, and thanks for raising this issue. Right now, resolution is the only wand.image.Image kwarg passable. I should either enable all kwargs or clarify the documentation on this point. To better understand the use-case: What is your particular intent with passing width?

pseudomonas commented 1 year ago

I had previously converted the PDF to images with ghostscript , and done some image-processing on the images. I wanted to generate some equivalently-sized images showing the PDF annotations.

In the end, I found another way to do it. But it'd be really useful to have some methods to do scaling of objects, where I can convert the pdf measurement units to pixels given one of an page-image height, an page-image width, or a DPI value.

jsvine commented 1 year ago

Thanks, that's helpful context. If I'm understanding correctly, I think your suggestion is this: To be able to call, e.g., page.to_image(width=1000) and have pdfplumber figure out the implied resolution based on the page height. Is that correct? If so, I think that makes sense and can see adding that.

pseudomonas commented 1 year ago

My original usecase was exactly as you say there, yes.

My second request was for a general translator of user-units to pixels so I could say:

page.convert_units_to_pixels(mychar["x0"], width=1000) and know how far mychar is from the border of my image (which I have colour-pre-processed in ways that are outside the scope of pdfplumber).

pseudomonas commented 1 year ago

(actual use-case of that: find the pixel region just to the right of the last character in each line, and see if there's an un-OCR'ed hyphen lurking there)

Actually a pair of methods convert_units_to_pixels and convert_pixels_to_units would be ideal. It's not exactly hard to do some division and rounding but it'd be a nice utility thing to have.

jsvine commented 1 year ago

width and height keyword arguments for .to_image(...) now available in v0.8.0. Give them a spin and let me know what you think.

As for your second request: