jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Page.search ValueError: min() arg is an empty sequence #683

Closed bpugnaire closed 2 years ago

bpugnaire commented 2 years ago

Describe the bug

When using the page.search method with a regex, the program may raise a ValueError: min() arg is an empty sequence.

Code to reproduce the problem

The issue is not easily reproducible, on some pages with the same Regex pattern I get the expected behavior (list of match or an empty list if no match). And sometimes I get the error

Expected behavior

Get a list of match or an empty list if there are no matches

Actual behavior

Got the ValueError: min() arg is an empty sequence

Environment

Additional context

Traceback (most recent call last):

File "c:\Users\xxx.virtualenvs\project\lib\site-packages\pdfplumber\page.py", line 316, in search return text_layout.search(pattern, regex=regex, case=case) File "c:\Users\xxx.virtualenvs\project\lib\site-packages\pdfplumber\utils.py", line 528, in search return list(map(match_to_dict, gen)) File "c:\Users\xxx.virtualenvs\project\lib\site-packages\pdfplumber\utils.py", line 499, in match_to_dict x0, top, x1, bottom = objects_to_bbox(chars) File "c:\Users\xx.virtualenvs\project\lib\site-packages\pdfplumber\utils.py", line 207, in objects_to_bbox min(map(itemgetter("x0"), objects)), ValueError: min() arg is an empty sequence

bpugnaire commented 2 years ago

To be noted, I only encountered the bug when using Regex = True

samkit-jain commented 2 years ago

Hi @bpugnaire Appreciate your interest in the library and thanks for raising this issue. Could you please provide the sample code that you used to reproduce this issue? It would help us investigate and fix. If possible, the PDF that resulted in this behaviour as well.

bpugnaire commented 2 years ago

match_group = '(Figure|Fig.|fig.|Tab.|tab.|Tabl.)' search_result = page.search("(?<!((|[))"+match_group, regex=True)

The regex is here to capture "xxx Figure 1xxx" and not "xxx [Figure1 xxx"

jsvine commented 2 years ago

Thanks @bpugnaire. Are you able to share the PDF? That would make it easier to diagnose the situation.

jsvine commented 2 years ago

This is fixed in the above commit, and available as of v0.7.1.