Closed dhdaines closed 1 year ago
Merging #937 (42ec17e) into develop (f6887b5) will not change coverage. The diff coverage is
100.00%
.
@@ Coverage Diff @@
## develop #937 +/- ##
==========================================
Coverage 100.00% 100.00%
==========================================
Files 18 19 +1
Lines 1588 1716 +128
==========================================
+ Hits 1588 1716 +128
Files Changed | Coverage Δ | |
---|---|---|
pdfplumber/page.py | 100.00% <100.00%> (ø) |
|
pdfplumber/structure.py | 100.00% <100.00%> (ø) |
Note that you can link the structure tree and text from marked content sections like this:
def get_text_by_mcid(page):
mcids = []
for c in page.chars:
mcid = c.get("mcid")
if mcid is None:
continue
while len(mcids) <= mcid:
mcids.append("")
mcids[mcid] += c["text"]
return mcids
def get_structure_tree_with_text(page):
texts = get_text_by_mcid(page)
st = page.structure_tree
d = deque(st)
while d:
el = d.popleft()
if "children" in el:
d.extend(el["children"])
if "mcids" in el:
el["mcids"] = [texts[mcid] for mcid in el["mcids"] if mcid < len(texts)]
return st
Not sure if this is helpful to have as a method in the Page
class? One thing I notice is that MCIDs are not reliably aligned to word breaks, they often change in the middle of a word for no apparent reason.
Another helpful example, if you want to for instance get the bounding box of a Table
element on a page:
def get_tables(page):
st = page.structure_tree
d = deque(st)
while d:
el = d.popleft()
if "children" in el:
d.extend(el["children"])
if el["type"] == "Table":
yield el
def get_child_mcids(el):
d = deque([el])
while d:
el = d.popleft()
if "children" in el:
d.extend(el["children"])
if "mcids" in el:
yield from el["mcids"]
t = next(get_tables(page))
mcids = set(get_child_mcids(t))
tbox = pdfplumber.utils.objects_to_bbox([
c for c in itertools.chain(page.chars, page.images) if c.get("mcid") in mcids
])
Another note - there is a small problem with this PR which is that there can be marked content sections which aren't referenced by the structure tree - this is specifically the case for headers and footers. The PR retains their MCIDs but nothing else so there isn't any way to detect them. I'll add a marked_content_sections
property which contains this information.
Another note - because this uses pypdfium2
to get the structure tree it introduces quite a lot of overhead, though only if you choose to access it, since now you are reading (at least partially) each page twice, once with pdfminer.six
and once with pypdfium2
. I will perhaps try to reimplement it with pdfminer.six
- the logic of resolving the structure tree is slightly complicated, but the pdf.js
implementation is a good guide.
Another note - because this uses pypdfium2 to get the structure tree it introduces quite a lot of overhead, though only if you choose to access it, since now you are reading (at least partially) each page twice, once with pdfminer.six and once with pypdfium2.
Thanks for flagging. As I understand it, this PR introduces two separate-but-related features:
mcid
attribute to each parsed object, which the PR currently handles entirely through subclassing pdfminer.six
's PDFPageAggregator
. pypdfium2
, which introduces the overhead you mentioned, and appears to fail on Python 3.8 (something related to _ctypes.PyCArrayType
but I haven't investigated closely).Would it be possible to separate this PR into those two distinct features? Is that something you'd be open to?
- Extracting a PDF's structure tree, which depends on
pypdfium2
, which introduces the overhead you mentioned, and appears to fail on Python 3.8 (something related to_ctypes.PyCArrayType
but I haven't investigated closely).Would it be possible to separate this PR into those two distinct features? Is that something you'd be open to?
Absolutely, especially since part 2 should probably be rewritten to use pdfminer.six
.
Actually part 1 ought to be a PR for pdfminer.six
itself but it doesn't seem likely that it could be merged anytime soon.
- Extracting a PDF's structure tree, which depends on
pypdfium2
, which introduces the overhead you mentioned, and appears to fail on Python 3.8 (something related to_ctypes.PyCArrayType
but I haven't investigated closely).
Ah, this can be fixed easily, it's just an issue with type-checking constructs that were introduced in Python 3.9 which slipped in accidentally.
Closing this PR and making two new ones! (the MCID one is there already: #961 )
Implements #909 with an IMHO rather convenient interface for marked content - IDs are listed in the structure tree, then propagated to objects in each page.