Closed LiutongZhou closed 3 years ago
Hi @LiutongZhou The issue you are facing is not necessarily a bug. The reason you are getting None
when extracting the text on the cropped page is because the cropped region has no text. The cropped region is just a single point as you can see from the image representation of the saved page below.
The bounding box (0, 0, 1, 1)
is just a square of area 1.
The bounding box (0, 0, 1, 1) is just a square of area 1 @samkit-jain
Hi @samkit-jain , is the above statement still true even if I set relative=True
in the crop method? If so, what differentiates relative=True
from relative=False
?
I was expecting that page.crop((a, b, c, d), relative=True)
as equivalent to page.crop((a * width, b * height, c * width, d * height),)
Please help me understand it.
Thanks
Hi @LiutongZhou, I think you may be misunderstanding the units of a bounding box. (0, 0, 1.,1.)
sounds like you may be trying to get the full width and height (since in some domains, 1 is equivalent to 100%). If that's your goal, then you will want something like this: (0, 0, page.width, page.height)
Hi @jsvine. Thank you for your explanation. But I was hoping that crop( (0, 0, 1. , 1.), relative=True)
would give me the same result as that returned by crop( ( 0, 0, page.width, page.height), relative=False)
.
I assume this is the intention for having this optional parameter relative
.
Is my understanding wrong?
Ah, now I better understand your question. Thank you for clarifying. Here is an explanation of the relative
parameter, from the documentation:
If
relative=True
, the bounding box is calculated as an offset from the top-left of the page's bounding box, rather than an absolute positioning. (See Issue #245 for a visual example and explanation.)
Okay, this is confusing :D
I would not be able to understand it if I hadn’t read the whole issue #245.
So relative=True
is equivalent to relative=False
when the whole page is being cropped. relative=True
makes a difference only when it is called from a cropped page, in which case coordinates are relative to the origin (top left corner ) of the cropped region of the original page.
So
relative=True
is equivalent torelative=False
when the whole page is being cropped.relative=True
makes a difference only when it is called from a cropped page, in which case coordinates are relative to the origin (top left corner ) of the cropped region of the original page.
Yep, exactly! That's a great summary.
hi, @jsvine ,When I extract the text in the specified area, there are images that will affect the accuracy of my extraction, and the spaces in the specified area will be removed. If I use Adobe Acrobat software to delete the picture and then extract the specified area, the extraction is normal. Please help, thank you very much
The Bug
setting relative box coordinates in
crop
and then doextract_text
is not working. page.crop(box_coordinates, relative=True)Code to reproduce the problem
PDF file
https://s22.q4cdn.com/407748750/files/doc_financials/2020/ar/2020-Proxy-Card.pdf
Expected behavior
Return the text of the page
Actual behavior
Return nothing
Environment