pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
4.49k stars 443 forks source link

trouble in page.find_tables #3592

Closed yuiant closed 1 week ago

yuiant commented 1 week ago

Description of the bug

I have a document and there is a table in every page,I want to get a list of Table object over the whole document,so i used the code below:

import fitz
doc = fitz.open('my_pdf.pdf')
my_tables = []
for page in doc:
    page_tables = page.find_tables().tables
    my_tables += page_tables

However,when I execute my_tables[0].extract(),the content did not belong to the first page but last page. It seems that in every loop step after executing page.find_tables, content of Table object that existed in my_tables will change。I look over the source code,maybe the 「global」 keyword make this trouble?

How to reproduce the bug

import fitz
doc = fitz.open('my_pdf.pdf')
my_tables = []
for page in doc:
    page_tables = page.find_tables().tables
    my_tables += page_tables

print(my_tables[0].extract())

PyMuPDF version

1.24.5

Operating system

MacOS

Python version

3.9

JorjMcKie commented 1 week ago

Table finder results are dependent on its page and do not survive the deletion of the page. You must save table contents in some page-independent way - for instance as extractions table.extract()[:] or table.to_pandas().