pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.19k stars 496 forks source link

Error on .find_tables() #3191

Closed JorjMcKie closed 7 months ago

JorjMcKie commented 7 months ago

Discussed in https://github.com/pymupdf/PyMuPDF/discussions/3190

Originally posted by **bjmvercelli** February 20, 2024 Hello, hope you guys are doing great. I'm getting an error in version 1.23.24 (latest) using **find_tables()** method, more specific on **extract_text()** call. The following code was extracted from `table.py` (lines 606 and 607). The error happens when `extract_words(chars)` returns an empty array. ```py words = extractor.extract_words(chars) rotation = words[0]["rotation"] # rotation cannot change within a cell ``` I do not believe that there's a problem in `extract_words()`, but i do believe that's an edge case from my [PDF](https://github.com/pymupdf/PyMuPDF/files/14345112/uel.pdf) and, if thats the case, we could fix it by validating the length of `words`: ```py words = extractor.extract_words(chars) if len(words) == 0: return "" rotation = words[0]["rotation"] # rotation cannot change within a cell ``` You can reproduce [here](https://colab.research.google.com/drive/19Ji4Ie-HEpyFx9fAfZqc0TeB6VBAjnAN?usp=sharing)
julian-smith-artifex-com commented 7 months ago

Fixed in 1.23.25.