Closed sreeni5493 closed 2 years ago
Thanks for flagging this. I'll look into ways of handling this. In the meantime, you should be able to handle your particular use-case this way:
page = page.crop((0, 0, page.bbox[2], page.bbox[3]))
Hello! I am also concerned about the issue. Any updates on how to deal with shifted origins?
@mlecauchois @samkit-jain @jsvine
I found the solution for this was removing the descent in pdfminer.six layout.py code
https://github.com/pdfminer/pdfminer.six/blob/develop/pdfminer/layout.py
In line number 306 and 307, remove descent
Previously:
bbox_lower_left = (0, descent + rise)
bbox_upper_right = (self.adv, descent + rise + fontsize)
Now:
bbox_lower_left = (0, rise)
bbox_upper_right = (self.adv, rise + fontsize)
But I am not sure why this works clearly though. All I can think of is somehow the descent information is not properly calculated.
Also wondering how Ghostscript fixes this.
I am testing across all my use cases to see if this descent is even needed
You can now pass strict=False
to Page.crop(...)
to allow passing a bounding box that extends beyond the page's bbox: https://github.com/jsvine/pdfplumber/releases/tag/v0.7.4
v1.pdf
In this PDF with the above code I get the error shown. Any simple way to fix this. Also if origin is shifted inside some mediabox, cropbox or something wont the library not work. I need to crop because sometimes there are text in negative coordinates which I do not want. So I need to crop text that is there in the visible area of the PDF.