pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
4.49k stars 443 forks source link

find_tables OOM #3607

Closed shmilysyq closed 6 days ago

shmilysyq commented 1 week ago

Description of the bug

when I process page.find_tables() of 5 pages async, and the 5 pages are all have figures. the memory will be increased all the time when process other pages with figures until oom

How to reproduce the bug

 async def _parsing_pdf(self,
                    file: Path,
                    ):
    tasks = []
    futures = []
    start_time = time.time()
    document = fitz.open(file)
    for page in document:
        page_number = page.number
        tasks.append(self._process_page(page,file))
        if len(tasks) >= 5:
            completed_futures = await asyncio.gather(*tasks)
            tasks.clear()
            futures.extend(completed_futures)
    completed_futures = await asyncio.gather(*tasks)   

async def _process_page(self,page,file_path): page_index = page.number image_list = page.get_images() table_finder = page.find_tables(strategy='lines_strict') table_list = table_finder.tables if len(image_list) == 0 and len(table_list)==0: logger.info(f"page {page_index+1} has no image and table")

PyMuPDF version

1.23.x or earlier

Operating system

MacOS

Python version

3.11

JorjMcKie commented 6 days ago

PyMuPDF does not support Python's multithreading - please see the documentation!

We therefore do not accept issues involving this feature.