jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Page.search results bbox position can be wrong #684

Closed bpugnaire closed 2 years ago

bpugnaire commented 2 years ago

Describe the bug

When drawing the bbox corresponding to page.search results, the bbox is often too far right.

Code to reproduce the problem

p = test_doc.pages[X] im = p.to_image() im.draw_rects(p.search(pattern, regex = True))

Expected behavior

The bbox should be around the word or pattern searched.

Actual behavior

It seems that, after the first match which is on point most of the time, the following matches' bbox have a positive horizontal offset. The offset is experimentally something around 20-30.

Screenshots

image

Here I'm searching for "fig." and the first bbox actually matches the second occurence of the pattern by pure luck.

Environment

jsvine commented 2 years ago

Hi @bpugnaire, and thanks for filing this issue. It's going to be very hard to diagnose what's happening without having access to the PDF. Are you able to share that?

jsvine commented 2 years ago

Hi @bpugnaire, I think I may have fixed the problem with this commit https://github.com/jsvine/pdfplumber/commit/12feadb8fdbd86fb2f596abe803a1a46d58a58da, now on the develop branch. Can you try that and see if it helps? Or, alternatively, can you share the PDF?

bpugnaire commented 2 years ago

Hello, the commit appears to solve the issue on my side, good job and thank you!

jsvine commented 2 years ago

Thanks for checking!