Closed shmilysyq closed 6 days ago
when I process page.find_tables() of 5 pages async, and the 5 pages are all have figures. the memory will be increased all the time when process other pages with figures until oom
async def _parsing_pdf(self, file: Path, ): tasks = [] futures = [] start_time = time.time() document = fitz.open(file) for page in document: page_number = page.number tasks.append(self._process_page(page,file)) if len(tasks) >= 5: completed_futures = await asyncio.gather(*tasks) tasks.clear() futures.extend(completed_futures) completed_futures = await asyncio.gather(*tasks)
async def _process_page(self,page,file_path): page_index = page.number image_list = page.get_images() table_finder = page.find_tables(strategy='lines_strict') table_list = table_finder.tables if len(image_list) == 0 and len(table_list)==0: logger.info(f"page {page_index+1} has no image and table")
1.23.x or earlier
MacOS
3.11
PyMuPDF does not support Python's multithreading - please see the documentation!
We therefore do not accept issues involving this feature.
Description of the bug
when I process page.find_tables() of 5 pages async, and the 5 pages are all have figures. the memory will be increased all the time when process other pages with figures until oom
How to reproduce the bug
async def _process_page(self,page,file_path): page_index = page.number image_list = page.get_images() table_finder = page.find_tables(strategy='lines_strict') table_list = table_finder.tables if len(image_list) == 0 and len(table_list)==0: logger.info(f"page {page_index+1} has no image and table")
PyMuPDF version
1.23.x or earlier
Operating system
MacOS
Python version
3.11