Closed wodny closed 3 months ago
Thank you for the detailed issue, @wodny. I'm not sure the response below resolves the entirety of what you're seeing, but it seems like a decent place to start.
As I understand it, a core problem you're seeing is this:
import pdfplumber
pdf = pdfplumber.open("pages-cut.pdf")
page = pdf.pages[1]
print(page.crop(page.bbox).extract_text())
... returns a blank string. Indeed, with a normal PDF, that'd be unexpected. But it seems the reason this is happening is that the coordinates of the page's characters are all outside the page's bbox ((420.9449, 0.0, 841.8898, 595.2756)
). For instance, taking just the first character, page.chars[0]
(omitting some keys for concision):
{'matrix': (8.000022, 0.0, 0.0, 8.000022, -318.393281, 342.49985499999997),
...
'x0': -318.393281,
...
'x1': -313.51326758,
...
'width': 4.880013420000012,
'height': 8.000022000000001,
'size': 8.000022000000001,
...
'text': 'F',
...
'top': 246.4637276420001,
'bottom': 254.4637496420001,
...}
Given those coordinates, I would not expect that character to be retained after page.crop(page.bbox))
, as it is outside the .bbox
.
Of course, if we look at the PDF itself in a PDF viewer, the characters appear normally. This suggests to me two possibilities, although perhaps I'm overlooking others:
pdfminer.six
(the dependency that handles coordinate calculations for pdfplumber
) has a bug and isn't calculating the coordinates of those characters correctlyWhat do you make of this assessment? Does it change your belief that there's a bug in page.crop(...)
? (To my eyes, page.crop(...)
is working as intended, but it's possible I haven't quite grokked your broader concern.)
Of course, if we look at the PDF itself in a PDF viewer, the characters appear normally. This suggests to me two possibilities, although perhaps I'm overlooking others:
The PDF does technically indicate those out-of-bounds positions for the text, but it's a common error that most/all PDF viewers know how to handle
pdfminer.six
(the dependency that handles coordinate calculations forpdfplumber
) has a bug and isn't calculating the coordinates of those characters correctly
I have done some more debugging (added some to pdfminer
) and I think it's neither. Not only because all tested viewers with different libraries underneath render the PDF correctly without any warnings and because I hope MuPDF creators know what they are doing. More importantly technical reasoning is as follows...
Some additional notes about pages-cut-x.pdf
:
MediaBox
attributes differ:
/MediaBox [0 0 420.9449 595.2756]
/MediaBox [420.9449 0 841.8898 595.2756]
.I have created a piece of code that creates pages the way pdfplumber
does:
#!/usr/bin/env python3
from pprint import pprint
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
from pdfminer.converter import PDFPageAggregator
from pdfminer.pdfinterp import PDFPageInterpreter
def get_layout(rsrcmgr, page, i):
device = PDFPageAggregator(
rsrcmgr,
pageno=i
)
interpreter = PDFPageInterpreter(rsrcmgr, device)
interpreter.process_page(page)
return device.get_result()
with open("pages-cut-x.pdf", "rb") as stream:
doc = PDFDocument(PDFParser(stream))
rsrcmgr = PDFResourceManager()
pages = list(PDFPage.create_pages(doc))
layouts = [ get_layout(rsrcmgr, page, i) for i, page in enumerate(pages, 1) ]
for i, layout in enumerate(layouts):
print(i, pages[i], layout)
for obj in layout._objs:
if obj.__class__.__name__ == "LTRect":
print(obj)
print()
This gives me the following output:
+++ page mediabox [0, 0, 420, 595] [0, 0, 420, 595]
+++ pts vs raw_pts cur_item <LTPage(1) 0.000,0.000,420.945,595.276 rotate=0>
+++ [[95, 425], [342, 425], [342, 170], [95, 170], [95, 425]]
+++ [[95, 169], [342, 169], [342, 425], [95, 425], [95, 169]]
+++ LTRect bbox (95.441, 425.880591, 342.40999999999997, 170.13459099999994)
+++ pts vs raw_pts cur_item <LTPage(1) 0.000,0.000,420.945,595.276 rotate=0>
+++ [[485, 425], [732, 425], [732, 170], [485, 170], [485, 425]]
+++ [[485, 169], [732, 169], [732, 425], [485, 425], [485, 169]]
+++ LTRect bbox (485.055, 425.880591, 732.024, 170.13459099999994)
+++ page mediabox [420, 0, 841, 595] [0, 0, 420, 595]
+++ pts vs raw_pts cur_item <LTPage(2) 0.000,0.000,420.945,595.276 rotate=0>
+++ [[-325, 425], [-78, 425], [-78, 170], [-325, 170], [-325, 425]]
+++ [[95, 169], [342, 169], [342, 425], [95, 425], [95, 169]]
+++ LTRect bbox (-325.50390000000004, 425.880591, -78.53490000000005, 170.13459099999994)
+++ pts vs raw_pts cur_item <LTPage(2) 0.000,0.000,420.945,595.276 rotate=0>
+++ [[64, 425], [311, 425], [311, 170], [64, 170], [64, 425]]
+++ [[485, 169], [732, 169], [732, 425], [485, 425], [485, 169]]
+++ LTRect bbox (64.11009999999999, 425.880591, 311.0791, 170.13459099999994)
0 <PDFPage: Resources={'ExtGState': {'a0': {'CA': 1, 'ca': 1}}, 'Font': {'f-0-0': <PDFObjRef:5>}}, MediaBox=[0, 0, 420.9449, 595.2756]> <LTPage(1) 0.000,0.000,420.945,595.276 rotate=0>
<LTRect 95.441,170.135,342.410,425.881>
<LTRect 485.055,170.135,732.024,425.881>
1 <PDFPage: Resources={'ExtGState': {'a0': {'CA': 1, 'ca': 1}}, 'Font': {'f-0-0': <PDFObjRef:5>}}, MediaBox=[420.9449, 0, 841.8898, 595.2756]> <LTPage(2) 0.000,0.000,420.945,595.276 rotate=0>
<LTRect -325.504,170.135,-78.535,425.881>
<LTRect 64.110,170.135,311.079,425.881>
LTRect
s are rendered in the context of a LTPage
collected by the PDFPage
. Note that LTPage
's bbox is normalized (converter.py
):
def begin_page(self, page: PDFPage, ctm: Matrix) -> None:
(x0, y0, x1, y1) = page.mediabox
(x0, y0) = apply_matrix_pt(ctm, (x0, y0))
(x1, y1) = apply_matrix_pt(ctm, (x1, y1))
mediabox = (0, 0, abs(x0 - x1), abs(y0 - y1))
self.cur_item = LTPage(self.pageno, mediabox)
while PDFPage
just uses the numbers from the object's /MediaBox
attribute. So when cropping is executed, LTRect
s have coordinates according to the normalized LTPage
container while page.mediabox
/page.bbox
is equal to the /MediaBox
attribute.
This means that in terms of geometry correct results are generated if crop()
is called with page.layout.bbox
instead of page.bbox
. But this doesn't mean it's the solution. Note that this requires passing strict=False
because the CroppedPage
constructor checks the cropping box against page.bbox
, not LTPage.bbox
. So this:
if strict:
test_proposed_bbox(crop_bbox, parent_page.bbox)
should probably become this:
if strict:
test_proposed_bbox(crop_bbox, parent_page.layout.bbox)
Additionally crop()
requires a note that the layout coordinates must be passed. Probably this is not the only required change as self.bbox = crop_bbox
in the constructor would still confuse /MediaBox
tag coordinate system with the LTPage
coordinates system.
Another big thanks for the detailed and thoughtful response, @wodny. This is a helpful clue you shared:
- both streams are identical,
only the
MediaBox
attributes differ:
- for object 6:
/MediaBox [0 0 420.9449 595.2756]
- for object 11:
/MediaBox [420.9449 0 841.8898 595.2756]
.
... in conjunction with this from the first block of output in your response:
0 <PDFPage: Resources={'ExtGState': {'a0': {'CA': 1, 'ca': 1}}, 'Font': {'f-0-0': <PDFObjRef:5>}}, MediaBox=[0, 0, 420.9449, 595.2756]> <LTPage(1) 0.000,0.000,420.945,595.276 rotate=0>
<LTRect 95.441,170.135,342.410,425.881>
<LTRect 485.055,170.135,732.024,425.881>
1 <PDFPage: Resources={'ExtGState': {'a0': {'CA': 1, 'ca': 1}}, 'Font': {'f-0-0': <PDFObjRef:5>}}, MediaBox=[420.9449, 0, 841.8898, 595.2756]> <LTPage(2) 0.000,0.000,420.945,595.276 rotate=0>
<LTRect -325.504,170.135,-78.535,425.881>
<LTRect 64.110,170.135,311.079,425.881>
As I understand it, MediaBox
should not alter the underlying coordinates of any of the graphical objects on the page — but rather describes a shiftable viewport. From the PDF reference:
And yet, as you point out, pdfminer.six
is indeed altering those coordinates. And, indeed, those LTRect coordinates in the output above have different coordinates in pdfminer.six
's output even though they come from the same objects.
Reverting pdfminer.six
's shift seems to resolve this issue, without any adjustment to how .crop(...)
works. (My instinct here is that .crop(...)
should require no changes, but still open to additional evidence on that point.)
With the changes in 9025c3f, this code:
import pdfplumber
pdf = pdfplumber.open("pages-cut-x.pdf")
for i, p in enumerate(pdf.pages):
print(f"--- Page {i + 1} ---")
print(p.crop(p.bbox).extract_table())
print("")
... produces this, which seems like the expected output:
--- Page 1 ---
[['FooCol1', 'FooCol2', 'FooCol3'], ['Foo4', 'Foo5', 'Foo6'], ['Foo7', 'Foo8', 'Foo9'], ['Foo10', 'Foo11', 'Foo12'], ['', '', '']]
--- Page 2 ---
[['BarCol1', 'BarCol2', 'BarCol3'], ['Bar4', 'Bar5', 'Bar6'], ['Bar7', 'Bar8', 'Bar9'], ['Bar10', 'Bar11', 'Bar12'], ['', '', '']]
I've also tested it with alterations to the MediaBox's y coordinates, and the fix seems robust those as well.
The proposed changes are available on the issue-1181 branch. Try it out and let me know if it resolves the issue for you?
As I understand it,
MediaBox
should not alter the underlying coordinates of any of the graphical objects on the page — but rather describes a shiftable viewport.
Indeed that makes the most sense.
And yet, as you point out,
pdfminer.six
is indeed altering those coordinates.
And just before that it creates a transformation matrix based on the mediabox.
And, indeed, those LTRect coordinates in the output above have different coordinates in
pdfminer.six
's output even though they come from the same objects.
It seems it started 15 years ago but is not described as a paradigm change but rather a means to an end (as I interpret it):
2009/08/27: Fixed page rotation handling.
Reverting
pdfminer.six
's shift seems to resolve this issue, without any adjustment to how.crop(...)
works. (My instinct here is that.crop(...)
should require no changes, but still open to additional evidence on that point.)With the changes in 9025c3f, this code: [...] ... produces this, which seems like the expected output [...]
Oh, this is nice (unifying objects coordinates)!
pdfminer does no scaling in the CTM/LTPage adjustment so offsetting should be enough. It also works after adding /Rotate
to the page.
The proposed changes are available on the issue-1181 branch. Try it out and let me know if it resolves the issue for you?
It does resolve the issue. Thank you.
Great, thanks! And thanks for the additional links and checks. The fix is now broadly available in the new v0.11.3
release.
It's not really clear to me that this is a bug in pdfminer.six
but rather a particular interpretation of what "device space" means - remember that pdfminer.six
has a (leaky) abstraction of a PDFDevice
which is where ultimately LTPage
is coming from. For perhaps obvious reasons, device space is not defined by the PDF standard (PDF 1.7 sec 8.3.2.2):
A particular device’s coordinate system is called its device space. The origin of the device space on different devices can fall in different places on the output page; on displays, the origin can vary depending on the window system. Because the paper or other output medium moves through different printers and imagesetters in different directions, the axes of their device spaces may be oriented differently.
You could say that pdfplumber
declares that its "device" space is default user space, with Rotation
applied, and flipped by 180 degrees, whereas pdfminer.six
considers it to be default user space with Rotation
applied and the origin of MediaBox
translated to [0 0]
.
Both are "correct" in that any interpretation of device space is correct, as long as it's understood what it means.
Describe the bug
It seems that
Page.crop()
:strict=True
),For pages which have a bbox not starting at
(0, 0)
this causespage.crop(page.bbox)
to return an empty set of objects. Addingrelative=True
does not help because it makes it two times worse. This is related to #245.Note that externally provided PDF's may already be cropped. This is what
mutool
(MuPDF) does when using theposter
function. It copies the page into multiple pages and then adjusts theirMediaBox
.Have you tried repairing the PDF?
Repairing the PDF fixes the problem but:
crop()
.BTW,
README.md
doesn't mentionrepair
.Code to reproduce the problem
In pdfplumber's
utils/geometry.py
:Effect:
PDF file
mutool poster -x 2
,Additionally available at my page.
Expected behavior
page.crop(page.bbox)
should be more-or-less an identity transformation.Actual behavior
page.crop(page.bbox)
returns an empty page and complains withstrict=True
when bbox does not start at (0, 0) .Screenshots
Original before cutting (two tables on one page later cut in half):
Environment
Additional context
If it gets accepted as a bug I can propose a patch.
It would probably look something like this: