jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

relative=True in page.extract_text() not working #391

Closed LiutongZhou closed 3 years ago

LiutongZhou commented 3 years ago

The Bug

setting relative box coordinates in crop and then do extract_text is not working. page.crop(box_coordinates, relative=True)

Code to reproduce the problem

import pdfplumber

!wget https://s22.q4cdn.com/407748750/files/doc_financials/2020/ar/2020-Proxy-Card.pdf -O Some.pdf --no-check-certificate

with pdfplumber.open("Some.pdf",) as pdf:
    page = pdf.pages[0]
    box_coordinates = (0, 0, 1.,1.)
    crop = page.crop(box_coordinates, relative=True)
    text = crop.extract_text()

assert text, "Not Working"

PDF file

https://s22.q4cdn.com/407748750/files/doc_financials/2020/ar/2020-Proxy-Card.pdf

Expected behavior

Return the text of the page

Actual behavior

Return nothing

Environment

samkit-jain commented 3 years ago

Hi @LiutongZhou The issue you are facing is not necessarily a bug. The reason you are getting None when extracting the text on the cropped page is because the cropped region has no text. The cropped region is just a single point as you can see from the image representation of the saved page below. image

The bounding box (0, 0, 1, 1) is just a square of area 1.

LiutongZhou commented 3 years ago

The bounding box (0, 0, 1, 1) is just a square of area 1 @samkit-jain

Hi @samkit-jain , is the above statement still true even if I set relative=True in the crop method? If so, what differentiates relative=True from relative=False?

I was expecting that page.crop((a, b, c, d), relative=True) as equivalent to page.crop((a * width, b * height, c * width, d * height),)

Please help me understand it.

Thanks

jsvine commented 3 years ago

Hi @LiutongZhou, I think you may be misunderstanding the units of a bounding box. (0, 0, 1.,1.) sounds like you may be trying to get the full width and height (since in some domains, 1 is equivalent to 100%). If that's your goal, then you will want something like this: (0, 0, page.width, page.height)

LiutongZhou commented 3 years ago

Hi @jsvine. Thank you for your explanation. But I was hoping that crop( (0, 0, 1. , 1.), relative=True) would give me the same result as that returned by crop( ( 0, 0, page.width, page.height), relative=False).

I assume this is the intention for having this optional parameter relative.

Is my understanding wrong?

jsvine commented 3 years ago

Ah, now I better understand your question. Thank you for clarifying. Here is an explanation of the relative parameter, from the documentation:

If relative=True, the bounding box is calculated as an offset from the top-left of the page's bounding box, rather than an absolute positioning. (See Issue #245 for a visual example and explanation.)

LiutongZhou commented 3 years ago

Okay, this is confusing :D

I would not be able to understand it if I hadn’t read the whole issue #245.

So relative=True is equivalent to relative=False when the whole page is being cropped. relative=True makes a difference only when it is called from a cropped page, in which case coordinates are relative to the origin (top left corner ) of the cropped region of the original page.

jsvine commented 3 years ago

So relative=True is equivalent to relative=False when the whole page is being cropped. relative=True makes a difference only when it is called from a cropped page, in which case coordinates are relative to the origin (top left corner ) of the cropped region of the original page.

Yep, exactly! That's a great summary.

situchen commented 3 years ago

image

hi, @jsvine ,When I extract the text in the specified area, there are images that will affect the accuracy of my extraction, and the spaces in the specified area will be removed. If I use Adobe Acrobat software to delete the picture and then extract the specified area, the extraction is normal. Please help, thank you very much

jsvine commented 3 years ago

Hi @situchen, your inquiry seems unrelated to this issue thread. Please instead open a discussion here and provide as much details (including the original PDF, the code you're using, etc.) as possible, so that we can best help you.