jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.02k stars 619 forks source link

`pdfplumber.utils` functions should take `Iterable` and not `List` arguments #945

Closed dhdaines closed 11 months ago

dhdaines commented 11 months ago

Currently, objects_to_bbox and friends require a list as input, but there's no reason they couldn't work equally well with generator expressions, iterators, or anything else iterable.

Well, actually there is a reason, which is that they are written such that they iterate repeatedly over their input for each attribute (and thus can't reuse the same iterable), but this doesn't have to be the case, for instance, merge_bboxes could be written as:

def merge_bboxes(bboxes: Iterable[T_bbox]) -> T_bbox:
    x0, top, x1, bottom = zip(*bboxes)
    return (min(x0), min(top), max(x1), max(bottom))

It's also a bit faster (note that there is no need to wrap the map in a list):

In [46]: %timeit pdfplumber.utils.merge_bboxes(list(map(bbox_getter, page.chars)))
464 µs ± 1.48 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [49]: %timeit merge_bboxes(map(bbox_getter, page.chars))
409 µs ± 1.18 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

I can supply a PR :)

dhdaines commented 11 months ago

See #946