jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.31k stars 647 forks source link

Handle `ValueError` exception when searching for text using regex #687

Closed samkit-jain closed 2 years ago

samkit-jain commented 2 years ago

This PR attempts to fix the issue #683 reported by @bpugnaire. The issue was happening because

chars = [c for (text, c) in subset if c is not None]

would return an empty array because of which the subsequent

x0, top, x1, bottom = objects_to_bbox(chars)

call would throw a ValueError exception.

I was able to reproduce the issue on the file issue-71-duplicate-chars-2.pdf using the regex \d+. The solution implied is a simple one with try-except handling. However, I am unsure if this is the correct solution because it would result in the output having objects with x0, x1, ... being null which may cause further processing issues as they are expected to be numbers. Another solution would be to not include these objects in the final answer but that may result in incomplete data being returned. Yet to debug why this issue happened in the first place, that is, a solution that would result in the chars list not being empty.

codecov[bot] commented 2 years ago

Codecov Report

Merging #687 (ad5c2bb) into develop (fd3f533) will not change coverage. The diff coverage is 100.00%.

:exclamation: Current head ad5c2bb differs from pull request most recent head 762ce03. Consider uploading reports for the commit 762ce03 to get more accurate results

@@            Coverage Diff            @@
##           develop      #687   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           12        12           
  Lines         1393      1396    +3     
=========================================
+ Hits          1393      1396    +3     
Impacted Files Coverage Δ
pdfplumber/utils.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update fd3f533...762ce03. Read the comment docs.

jsvine commented 2 years ago

Many thanks for finding a PDF that allows us to reproduce the problem, and for writing the test, @samkit-jain! As you suspected, the try-except approach treated the symptoms of a deeper issue, rather than the cause. But your PR really helped to investigate the problem, which ended up being in LayoutEngine.calculate(...), which had been assuming that len(char["text"]) would always equal 1, which is not true for ligatures. (The test PDF contains the ligature.) This created a mismatch between the string that .search(...) was using and the character-order table (LayoutEngine.layout_tuples). The fix turned out to be relatively simple, which is to iterate over all letters in char["text"]: https://github.com/jsvine/pdfplumber/commit/12feadb8fdbd86fb2f596abe803a1a46d58a58da

Worth noting: This fix uses the same char object data for both the f and i in the fi ligature (as an example), rather than, e.g., splitting recalculating the bounding box and other information as if the ligature had been split in half. Given the somewhat limited purpose of that step, I think that's probably OK, but a side effect will be that search-matches beginning or ending with just one letter in the ligature will have a bounding box that spans both.

Thanks again!