jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.48k stars 658 forks source link

Add support for structure tree and marked content sections #937

Closed dhdaines closed 1 year ago

dhdaines commented 1 year ago

Implements #909 with an IMHO rather convenient interface for marked content - IDs are listed in the structure tree, then propagated to objects in each page.

codecov[bot] commented 1 year ago

Codecov Report

Merging #937 (42ec17e) into develop (f6887b5) will not change coverage. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           develop      #937    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files           18        19     +1     
  Lines         1588      1716   +128     
==========================================
+ Hits          1588      1716   +128     
Files Changed Coverage Δ
pdfplumber/page.py 100.00% <100.00%> (ø)
pdfplumber/structure.py 100.00% <100.00%> (ø)
dhdaines commented 1 year ago

Note that you can link the structure tree and text from marked content sections like this:

def get_text_by_mcid(page):
    mcids = []
    for c in page.chars:
        mcid = c.get("mcid")
        if mcid is None:
            continue
        while len(mcids) <= mcid:
            mcids.append("")
        mcids[mcid] += c["text"]
    return mcids

def get_structure_tree_with_text(page):
    texts = get_text_by_mcid(page)
    st = page.structure_tree
    d = deque(st)
    while d:
        el = d.popleft()
        if "children" in el:
            d.extend(el["children"])
        if "mcids" in el:
            el["mcids"] = [texts[mcid] for mcid in el["mcids"] if mcid < len(texts)]
    return st

Not sure if this is helpful to have as a method in the Page class? One thing I notice is that MCIDs are not reliably aligned to word breaks, they often change in the middle of a word for no apparent reason.

dhdaines commented 1 year ago

Another helpful example, if you want to for instance get the bounding box of a Table element on a page:

def get_tables(page):
    st = page.structure_tree
    d = deque(st)
    while d:
        el = d.popleft()
        if "children" in el:
            d.extend(el["children"])
        if el["type"] == "Table":
            yield el

def get_child_mcids(el):
    d = deque([el])
    while d:
        el = d.popleft()
        if "children" in el:
            d.extend(el["children"])
        if "mcids" in el:
            yield from el["mcids"]

t = next(get_tables(page))
mcids = set(get_child_mcids(t))
tbox = pdfplumber.utils.objects_to_bbox([
    c for c in itertools.chain(page.chars, page.images) if c.get("mcid") in mcids
])
dhdaines commented 1 year ago

Another note - there is a small problem with this PR which is that there can be marked content sections which aren't referenced by the structure tree - this is specifically the case for headers and footers. The PR retains their MCIDs but nothing else so there isn't any way to detect them. I'll add a marked_content_sections property which contains this information.

dhdaines commented 1 year ago

Another note - because this uses pypdfium2 to get the structure tree it introduces quite a lot of overhead, though only if you choose to access it, since now you are reading (at least partially) each page twice, once with pdfminer.six and once with pypdfium2. I will perhaps try to reimplement it with pdfminer.six - the logic of resolving the structure tree is slightly complicated, but the pdf.js implementation is a good guide.

jsvine commented 1 year ago

Another note - because this uses pypdfium2 to get the structure tree it introduces quite a lot of overhead, though only if you choose to access it, since now you are reading (at least partially) each page twice, once with pdfminer.six and once with pypdfium2.

Thanks for flagging. As I understand it, this PR introduces two separate-but-related features:

  1. Adding the mcid attribute to each parsed object, which the PR currently handles entirely through subclassing pdfminer.six's PDFPageAggregator.
  2. Extracting a PDF's structure tree, which depends on pypdfium2, which introduces the overhead you mentioned, and appears to fail on Python 3.8 (something related to _ctypes.PyCArrayType but I haven't investigated closely).

Would it be possible to separate this PR into those two distinct features? Is that something you'd be open to?

dhdaines commented 1 year ago
  1. Extracting a PDF's structure tree, which depends on pypdfium2, which introduces the overhead you mentioned, and appears to fail on Python 3.8 (something related to _ctypes.PyCArrayType but I haven't investigated closely).

Would it be possible to separate this PR into those two distinct features? Is that something you'd be open to?

Absolutely, especially since part 2 should probably be rewritten to use pdfminer.six.

Actually part 1 ought to be a PR for pdfminer.six itself but it doesn't seem likely that it could be merged anytime soon.

dhdaines commented 1 year ago
  1. Extracting a PDF's structure tree, which depends on pypdfium2, which introduces the overhead you mentioned, and appears to fail on Python 3.8 (something related to _ctypes.PyCArrayType but I haven't investigated closely).

Ah, this can be fixed easily, it's just an issue with type-checking constructs that were introduced in Python 3.9 which slipped in accidentally.

dhdaines commented 1 year ago

Closing this PR and making two new ones! (the MCID one is there already: #961 )