jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.83k stars 681 forks source link

inconsistent coordinate systems when cropping #1181

Closed wodny closed 3 months ago

wodny commented 4 months ago

Describe the bug

It seems that Page.crop():

For pages which have a bbox not starting at (0, 0) this causes page.crop(page.bbox) to return an empty set of objects. Adding relative=True does not help because it makes it two times worse. This is related to #245.

Note that externally provided PDF's may already be cropped. This is what mutool (MuPDF) does when using the poster function. It copies the page into multiple pages and then adjusts their MediaBox.

Have you tried repairing the PDF?

Repairing the PDF fixes the problem but:

BTW, README.md doesn't mention repair.

Code to reproduce the problem

import pdfplumber as pp

with pp.open("pages-cut-x.pdf") as f:
    p1, p2 = f.pages[:2]

    print("page 1 bbox", p1.bbox, "rects:", len(p1.rects))
    print("page 2 bbox", p2.bbox, "rects:", len(p2.rects))

    print()
    print("page 1 tables")
    for t in p1.find_tables():
        print(tuple(map(int, t.bbox)))

    print()
    print("page 2 tables")
    for t in p2.find_tables():
        print(tuple(map(int, t.bbox)))

    print()
    print("page 1 text")
    print(p1.extract_text_simple())
    print()
    print("page 1 text (cropped)")
    print(p1.crop(p1.bbox).extract_text_simple())

    print()

    print("page 2 text")
    print(p2.extract_text_simple())
    print()
    print("page 2 text (cropped)")
    print(p2.crop(p2.bbox).extract_text_simple() or "--- no text ---")
    print()
    # with strict=True:
    # ValueError: Bounding box (0, 0.0, 420.9449, 595.2756) is not fully within parent page bounding box (420.9449, 0.0, 841.8898, 595.2756)
    print("page 2 text (cropped non-strict by p1 bbox)")
    print(p2.crop(p1.bbox, strict=False).extract_text_simple())
    print()
    print("page 2 text (cropped non-strict by p2 bbox translated to 0,0)")
    print(
        p2.crop(
            (
                0,
                0,
                p2.bbox[2] - p2.bbox[0],
                p2.bbox[3] - p2.bbox[1]
            ),
            strict=False
        ).extract_text_simple()
    )

In pdfplumber's utils/geometry.py:

def clip_obj(obj: T_obj, bbox: T_bbox) -> Optional[T_obj]:
    overlap = get_bbox_overlap(obj_to_bbox(obj), bbox)
    if overlap is None:
        if obj["object_type"] == "rect":
            print(">>> missed obj", f"{obj['x0']:4.0f}, {obj['y0']:4.0f}", bbox)
        return None
    [...]

Effect:

page 1 bbox (0, 0.0, 420.9449, 595.2756) rects: 2
page 2 bbox (420.9449, 0.0, 841.8898, 595.2756) rects: 2

page 1 tables
(95, 169, 342, 425)
(485, 169, 732, 425)

page 2 tables
(-325, 169, -78, 425)
(64, 169, 311, 425)

page 1 text
FooCol1 FooCol2 FooCol3 BarCol1 BarCol2 BarCol3
Foo4 Foo5 Foo6 Bar4 Bar5 Bar6
Foo7 Foo8 Foo9 Bar7 Bar8 Bar9
Foo10 Foo11 Foo12 Bar10 Bar11 Bar12

page 1 text (cropped)
>>> crop bbox (0, 0.0, 420.9449, 595.2756)
>>> missed obj  485,  170 (0, 0.0, 420.9449, 595.2756)
>>> rects 1
FooCol1 FooCol2 FooCol3
Foo4 Foo5 Foo6
Foo7 Foo8 Foo9
Foo10 Foo11 Foo12

page 2 text
FooCol1 FooCol2 FooCol3 BarCol1 BarCol2 BarCol3
Foo4 Foo5 Foo6 Bar4 Bar5 Bar6
Foo7 Foo8 Foo9 Bar7 Bar8 Bar9
Foo10 Foo11 Foo12 Bar10 Bar11 Bar12

page 2 text (cropped)
>>> crop bbox (420.9449, 0.0, 841.8898, 595.2756)
>>> missed obj -326,  170 (420.9449, 0.0, 841.8898, 595.2756)
>>> missed obj   64,  170 (420.9449, 0.0, 841.8898, 595.2756)
>>> rects 0
--- no text ---

page 2 text (cropped non-strict by p1 bbox)
>>> crop bbox (0, 0.0, 420.9449, 595.2756)
>>> missed obj -326,  170 (0, 0.0, 420.9449, 595.2756)
>>> rects 1
BarCol1 BarCol2 BarCol3
Bar4 Bar5 Bar6
Bar7 Bar8 Bar9
Bar10 Bar11 Bar12

page 2 text (cropped non-strict by p2 bbox translated to 0,0)
>>> crop bbox (0, 0, 420.9449, 595.2756)
>>> missed obj -326,  170 (0, 0, 420.9449, 595.2756)
>>> rects 1
BarCol1 BarCol2 BarCol3
Bar4 Bar5 Bar6
Bar7 Bar8 Bar9
Bar10 Bar11 Bar12

PDF file

Additionally available at my page.

Expected behavior

page.crop(page.bbox) should be more-or-less an identity transformation.

Actual behavior

page.crop(page.bbox) returns an empty page and complains with strict=True when bbox does not start at (0, 0) .

Screenshots

Original before cutting (two tables on one page later cut in half):

FooCol1   FooCol2   FooCol3 |  BarCol1   BarCol2   BarCol3
                            |
Foo4      Foo5      Foo6    |  Bar4      Bar5      Bar6
                            |
Foo7      Foo8      Foo9    |  Bar7      Bar8      Bar9
                            |
Foo10     Foo11     Foo12   |  Bar10     Bar11     Bar12

Environment

Additional context

If it gets accepted as a bug I can propose a patch.

It would probably look something like this:

class CroppedPage(DerivedPage):
    def __init__([...]):
        [...]
        def _crop_fn(objs: T_obj_list) -> T_obj_list:
            crop_bbox_adj = (
                crop_bbox[0] - o_x0,
                crop_bbox[1] - o_top,
                crop_bbox[2] - o_x0,
                crop_bbox[3] - o_top
            )
            return crop_fn(objs, crop_bbox_adj)
jsvine commented 4 months ago

Thank you for the detailed issue, @wodny. I'm not sure the response below resolves the entirety of what you're seeing, but it seems like a decent place to start.

As I understand it, a core problem you're seeing is this:

import pdfplumber
pdf = pdfplumber.open("pages-cut.pdf")
page = pdf.pages[1]
print(page.crop(page.bbox).extract_text())

... returns a blank string. Indeed, with a normal PDF, that'd be unexpected. But it seems the reason this is happening is that the coordinates of the page's characters are all outside the page's bbox ((420.9449, 0.0, 841.8898, 595.2756)). For instance, taking just the first character, page.chars[0] (omitting some keys for concision):

{'matrix': (8.000022, 0.0, 0.0, 8.000022, -318.393281, 342.49985499999997),
...
 'x0': -318.393281,
...
 'x1': -313.51326758,
...
 'width': 4.880013420000012,
 'height': 8.000022000000001,
 'size': 8.000022000000001,
...
 'text': 'F',
...
 'top': 246.4637276420001,
 'bottom': 254.4637496420001,
...}

Given those coordinates, I would not expect that character to be retained after page.crop(page.bbox)), as it is outside the .bbox.

Of course, if we look at the PDF itself in a PDF viewer, the characters appear normally. This suggests to me two possibilities, although perhaps I'm overlooking others:

What do you make of this assessment? Does it change your belief that there's a bug in page.crop(...)? (To my eyes, page.crop(...) is working as intended, but it's possible I haven't quite grokked your broader concern.)

wodny commented 4 months ago

Of course, if we look at the PDF itself in a PDF viewer, the characters appear normally. This suggests to me two possibilities, although perhaps I'm overlooking others:

  • The PDF does technically indicate those out-of-bounds positions for the text, but it's a common error that most/all PDF viewers know how to handle

  • pdfminer.six (the dependency that handles coordinate calculations for pdfplumber) has a bug and isn't calculating the coordinates of those characters correctly

I have done some more debugging (added some to pdfminer) and I think it's neither. Not only because all tested viewers with different libraries underneath render the PDF correctly without any warnings and because I hope MuPDF creators know what they are doing. More importantly technical reasoning is as follows...

Some additional notes about pages-cut-x.pdf:

I have created a piece of code that creates pages the way pdfplumber does:

#!/usr/bin/env python3

from pprint import pprint

from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
from pdfminer.converter import PDFPageAggregator
from pdfminer.pdfinterp import PDFPageInterpreter

def get_layout(rsrcmgr, page, i):
    device = PDFPageAggregator(
        rsrcmgr,
        pageno=i
    )
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    interpreter.process_page(page)
    return device.get_result()

with open("pages-cut-x.pdf", "rb") as stream:
    doc = PDFDocument(PDFParser(stream))
    rsrcmgr = PDFResourceManager()
    pages = list(PDFPage.create_pages(doc))
    layouts = [ get_layout(rsrcmgr, page, i) for i, page in enumerate(pages, 1) ]
    for i, layout in enumerate(layouts):
        print(i, pages[i], layout)
        for obj in layout._objs:
            if obj.__class__.__name__ == "LTRect":
                print(obj)
        print()

This gives me the following output:

+++ page mediabox [0, 0, 420, 595] [0, 0, 420, 595]
+++ pts vs raw_pts cur_item <LTPage(1) 0.000,0.000,420.945,595.276 rotate=0>
+++      [[95, 425], [342, 425], [342, 170], [95, 170], [95, 425]]
+++      [[95, 169], [342, 169], [342, 425], [95, 425], [95, 169]]
+++ LTRect bbox (95.441, 425.880591, 342.40999999999997, 170.13459099999994)
+++ pts vs raw_pts cur_item <LTPage(1) 0.000,0.000,420.945,595.276 rotate=0>
+++      [[485, 425], [732, 425], [732, 170], [485, 170], [485, 425]]
+++      [[485, 169], [732, 169], [732, 425], [485, 425], [485, 169]]
+++ LTRect bbox (485.055, 425.880591, 732.024, 170.13459099999994)

+++ page mediabox [420, 0, 841, 595] [0, 0, 420, 595]
+++ pts vs raw_pts cur_item <LTPage(2) 0.000,0.000,420.945,595.276 rotate=0>
+++      [[-325, 425], [-78, 425], [-78, 170], [-325, 170], [-325, 425]]
+++      [[95, 169], [342, 169], [342, 425], [95, 425], [95, 169]]
+++ LTRect bbox (-325.50390000000004, 425.880591, -78.53490000000005, 170.13459099999994)
+++ pts vs raw_pts cur_item <LTPage(2) 0.000,0.000,420.945,595.276 rotate=0>
+++      [[64, 425], [311, 425], [311, 170], [64, 170], [64, 425]]
+++      [[485, 169], [732, 169], [732, 425], [485, 425], [485, 169]]
+++ LTRect bbox (64.11009999999999, 425.880591, 311.0791, 170.13459099999994)
0 <PDFPage: Resources={'ExtGState': {'a0': {'CA': 1, 'ca': 1}}, 'Font': {'f-0-0': <PDFObjRef:5>}}, MediaBox=[0, 0, 420.9449, 595.2756]> <LTPage(1) 0.000,0.000,420.945,595.276 rotate=0>
<LTRect 95.441,170.135,342.410,425.881>
<LTRect 485.055,170.135,732.024,425.881>

1 <PDFPage: Resources={'ExtGState': {'a0': {'CA': 1, 'ca': 1}}, 'Font': {'f-0-0': <PDFObjRef:5>}}, MediaBox=[420.9449, 0, 841.8898, 595.2756]> <LTPage(2) 0.000,0.000,420.945,595.276 rotate=0>
<LTRect -325.504,170.135,-78.535,425.881>
<LTRect 64.110,170.135,311.079,425.881>

LTRects are rendered in the context of a LTPage collected by the PDFPage. Note that LTPage's bbox is normalized (converter.py):

    def begin_page(self, page: PDFPage, ctm: Matrix) -> None:
        (x0, y0, x1, y1) = page.mediabox
        (x0, y0) = apply_matrix_pt(ctm, (x0, y0))
        (x1, y1) = apply_matrix_pt(ctm, (x1, y1))
        mediabox = (0, 0, abs(x0 - x1), abs(y0 - y1))
        self.cur_item = LTPage(self.pageno, mediabox)

while PDFPage just uses the numbers from the object's /MediaBox attribute. So when cropping is executed, LTRects have coordinates according to the normalized LTPage container while page.mediabox/page.bbox is equal to the /MediaBox attribute.

This means that in terms of geometry correct results are generated if crop() is called with page.layout.bbox instead of page.bbox. But this doesn't mean it's the solution. Note that this requires passing strict=False because the CroppedPage constructor checks the cropping box against page.bbox, not LTPage.bbox. So this:

if strict:
    test_proposed_bbox(crop_bbox, parent_page.bbox)

should probably become this:

if strict:
    test_proposed_bbox(crop_bbox, parent_page.layout.bbox)

Additionally crop() requires a note that the layout coordinates must be passed. Probably this is not the only required change as self.bbox = crop_bbox in the constructor would still confuse /MediaBox tag coordinate system with the LTPage coordinates system.

jsvine commented 4 months ago

Another big thanks for the detailed and thoughtful response, @wodny. This is a helpful clue you shared:

  • both streams are identical,
  • only the MediaBox attributes differ:

    • for object 6: /MediaBox [0 0 420.9449 595.2756]
    • for object 11: /MediaBox [420.9449 0 841.8898 595.2756].

... in conjunction with this from the first block of output in your response:

0 <PDFPage: Resources={'ExtGState': {'a0': {'CA': 1, 'ca': 1}}, 'Font': {'f-0-0': <PDFObjRef:5>}}, MediaBox=[0, 0, 420.9449, 595.2756]> <LTPage(1) 0.000,0.000,420.945,595.276 rotate=0>
<LTRect 95.441,170.135,342.410,425.881>
<LTRect 485.055,170.135,732.024,425.881>

1 <PDFPage: Resources={'ExtGState': {'a0': {'CA': 1, 'ca': 1}}, 'Font': {'f-0-0': <PDFObjRef:5>}}, MediaBox=[420.9449, 0, 841.8898, 595.2756]> <LTPage(2) 0.000,0.000,420.945,595.276 rotate=0>
<LTRect -325.504,170.135,-78.535,425.881>
<LTRect 64.110,170.135,311.079,425.881>

As I understand it, MediaBox should not alter the underlying coordinates of any of the graphical objects on the page — but rather describes a shiftable viewport. From the PDF reference:

Screenshot 2024-08-05 at 11 17 48 AM

And yet, as you point out, pdfminer.six is indeed altering those coordinates. And, indeed, those LTRect coordinates in the output above have different coordinates in pdfminer.six's output even though they come from the same objects.

Reverting pdfminer.six's shift seems to resolve this issue, without any adjustment to how .crop(...) works. (My instinct here is that .crop(...) should require no changes, but still open to additional evidence on that point.)

With the changes in 9025c3f, this code:

import pdfplumber
pdf = pdfplumber.open("pages-cut-x.pdf")
for i, p in enumerate(pdf.pages):
    print(f"--- Page {i + 1} ---")
    print(p.crop(p.bbox).extract_table())
    print("")

... produces this, which seems like the expected output:

--- Page 1 ---
[['FooCol1', 'FooCol2', 'FooCol3'], ['Foo4', 'Foo5', 'Foo6'], ['Foo7', 'Foo8', 'Foo9'], ['Foo10', 'Foo11', 'Foo12'], ['', '', '']]

--- Page 2 ---
[['BarCol1', 'BarCol2', 'BarCol3'], ['Bar4', 'Bar5', 'Bar6'], ['Bar7', 'Bar8', 'Bar9'], ['Bar10', 'Bar11', 'Bar12'], ['', '', '']]

I've also tested it with alterations to the MediaBox's y coordinates, and the fix seems robust those as well.

The proposed changes are available on the issue-1181 branch. Try it out and let me know if it resolves the issue for you?

wodny commented 4 months ago

As I understand it, MediaBox should not alter the underlying coordinates of any of the graphical objects on the page — but rather describes a shiftable viewport.

Indeed that makes the most sense.

And yet, as you point out, pdfminer.six is indeed altering those coordinates.

And just before that it creates a transformation matrix based on the mediabox.

And, indeed, those LTRect coordinates in the output above have different coordinates in pdfminer.six's output even though they come from the same objects.

It seems it started 15 years ago but is not described as a paradigm change but rather a means to an end (as I interpret it):

2009/08/27: Fixed page rotation handling.

Reverting pdfminer.six's shift seems to resolve this issue, without any adjustment to how .crop(...) works. (My instinct here is that .crop(...) should require no changes, but still open to additional evidence on that point.)

With the changes in 9025c3f, this code: [...] ... produces this, which seems like the expected output [...]

Oh, this is nice (unifying objects coordinates)!

pdfminer does no scaling in the CTM/LTPage adjustment so offsetting should be enough. It also works after adding /Rotate to the page.

The proposed changes are available on the issue-1181 branch. Try it out and let me know if it resolves the issue for you?

It does resolve the issue. Thank you.

jsvine commented 3 months ago

Great, thanks! And thanks for the additional links and checks. The fix is now broadly available in the new v0.11.3 release.

dhdaines commented 1 week ago

It's not really clear to me that this is a bug in pdfminer.six but rather a particular interpretation of what "device space" means - remember that pdfminer.six has a (leaky) abstraction of a PDFDevice which is where ultimately LTPage is coming from. For perhaps obvious reasons, device space is not defined by the PDF standard (PDF 1.7 sec 8.3.2.2):

A particular device’s coordinate system is called its device space. The origin of the device space on different devices can fall in different places on the output page; on displays, the origin can vary depending on the window system. Because the paper or other output medium moves through different printers and imagesetters in different directions, the axes of their device spaces may be oriented differently.

You could say that pdfplumber declares that its "device" space is default user space, with Rotation applied, and flipped by 180 degrees, whereas pdfminer.six considers it to be default user space with Rotation applied and the origin of MediaBox translated to [0 0].

Both are "correct" in that any interpretation of device space is correct, as long as it's understood what it means.