Open nataliia-obraztsova opened 4 months ago
Adding fitz.TOOLS.store_shrink(100) after pix = None actually helped a lot. Here is a link to an older issue which I missed at first https://github.com/pymupdf/PyMuPDF/issues/130 I still have some gradual increase so I'll leave the issue open for now.
Can you please provide printouts with numbers updated after the mentioned adjustments?
In general, if a permanently low memory footprint is desired (for whatever reasons), shrinking the store usage should be used generously. This is because of a number of reasons:
Below you can see memory profiling after adjustments. The interesting thing is that while processing the file f0 fitz.TOOLS.store_shrink(100) in line 47 seems to made no difference, but memory usage increased only by 7MiB. And didn't shrink back to initial number. While processing file f1, fitz.TOOLS.store_shrink(100) in line 47 reduced memory usage a lot. But still not all of it. Additional 20.12 MB added up. Then it seems to plateau.
P.S. I have upgraded PyMuPDF to 1.24.7
Memory usage before function: 53.28 MB
34 53.5 MiB 53.5 MiB 1 @profile
35 def render_page_to_image(file_name):
36 53.7 MiB 0.1 MiB 1 file_stream = read_file(file_name)
37 56.0 MiB 2.4 MiB 1 doc = fitz.open(stream=file_stream, filetype="pdf")
38 56.0 MiB 0.0 MiB 1 try:
39 56.0 MiB 0.0 MiB 1 number_of_pages = doc.page_count
40 67.4 MiB 0.0 MiB 4 for i in range(number_of_pages):
41 67.4 MiB 0.2 MiB 3 page = doc.load_page(i)
42
43 # Render the page to a pixmap (an image)
44 67.4 MiB 7.0 MiB 3 pix = page.get_pixmap()
45 67.4 MiB 3.5 MiB 3 img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
46 67.4 MiB 0.0 MiB 3 pix = None
47 67.4 MiB 0.0 MiB 3 fitz.TOOLS.store_shrink(100)
48 # Convert the PIL Image to a bytes-like object
49 67.4 MiB 0.0 MiB 3 img_byte_buff = BytesIO()
50 67.4 MiB 0.6 MiB 3 img.save(img_byte_buff, format='JPEG')
51 67.4 MiB 0.0 MiB 3 img_byte_arr = img_byte_buff.getvalue()
52
53 # Encode the image bytes in base64 and decode to UTF-8 string
54 67.4 MiB 0.0 MiB 3 rendered_image = base64.b64encode(img_byte_arr).decode('utf-8')
55
56 except Exception as e:
57 raise Exception(e.args)
58 finally:
59 67.4 MiB 0.0 MiB 1 doc.close()
60 67.4 MiB 0.0 MiB 1 fitz.TOOLS.store_shrink(100)
Memory usage after function: 60.41 MB Memory usage difference total: 7.13 MB
Memory usage before function: 60.41 MB
34 60.4 MiB 60.4 MiB 1 @profile
35 def render_page_to_image(file_name):
36 65.7 MiB 5.2 MiB 1 file_stream = read_file(file_name)
37 65.7 MiB 0.0 MiB 1 doc = fitz.open(stream=file_stream, filetype="pdf")
38 65.7 MiB 0.0 MiB 1 try:
39 65.7 MiB 0.0 MiB 1 number_of_pages = doc.page_count
40 100.4 MiB -70.7 MiB 33 for i in range(number_of_pages):
41 100.4 MiB -56.1 MiB 32 page = doc.load_page(i)
42
43 # Render the page to a pixmap (an image)
44 145.3 MiB 194.0 MiB 32 pix = page.get_pixmap()
45 145.3 MiB -289.6 MiB 32 img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
46 145.3 MiB -289.6 MiB 32 pix = None
47 100.4 MiB -519.4 MiB 32 fitz.TOOLS.store_shrink(100)
48 # Convert the PIL Image to a bytes-like object
49 100.4 MiB -70.7 MiB 32 img_byte_buff = BytesIO()
50 100.4 MiB -70.7 MiB 32 img.save(img_byte_buff, format='JPEG')
51 100.4 MiB -70.7 MiB 32 img_byte_arr = img_byte_buff.getvalue()
52
53 # Encode the image bytes in base64 and decode to UTF-8 string
54 100.4 MiB -70.7 MiB 32 rendered_image = base64.b64encode(img_byte_arr).decode('utf-8')
55
56 except Exception as e:
57 raise Exception(e.args)
58 finally:
59 85.8 MiB -14.6 MiB 1 doc.close()
60 85.8 MiB 0.0 MiB 1 fitz.TOOLS.store_shrink(100)
Memory usage after function: 80.53 MB Memory usage difference total: 20.12 MB
Memory usage before function: 80.53 MB
34 80.5 MiB 80.5 MiB 1 @profile
35 def render_page_to_image(file_name):
36 80.5 MiB 0.0 MiB 1 file_stream = read_file(file_name)
37 80.5 MiB 0.0 MiB 1 doc = fitz.open(stream=file_stream, filetype="pdf")
38 80.5 MiB 0.0 MiB 1 try:
39 80.5 MiB 0.0 MiB 1 number_of_pages = doc.page_count
40 80.5 MiB 0.0 MiB 4 for i in range(number_of_pages):
41 80.5 MiB 0.0 MiB 3 page = doc.load_page(i)
42
43 # Render the page to a pixmap (an image)
44 80.5 MiB 0.0 MiB 3 pix = page.get_pixmap()
45 80.5 MiB 0.0 MiB 3 img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
46 80.5 MiB 0.0 MiB 3 pix = None
47 80.5 MiB 0.0 MiB 3 fitz.TOOLS.store_shrink(100)
48 # Convert the PIL Image to a bytes-like object
49 80.5 MiB 0.0 MiB 3 img_byte_buff = BytesIO()
50 80.5 MiB 0.0 MiB 3 img.save(img_byte_buff, format='JPEG')
51 80.5 MiB 0.0 MiB 3 img_byte_arr = img_byte_buff.getvalue()
52
53 # Encode the image bytes in base64 and decode to UTF-8 string
54 80.5 MiB 0.0 MiB 3 rendered_image = base64.b64encode(img_byte_arr).decode('utf-8')
55
56 except Exception as e:
57 raise Exception(e.args)
58 finally:
59 80.5 MiB 0.0 MiB 1 doc.close()
60 80.5 MiB 0.0 MiB 1 fitz.TOOLS.store_shrink(100)
Memory usage after function: 80.53 MB Memory usage difference total: 0.00 MB
Memory usage before function: 80.53 MB
34 80.5 MiB 80.5 MiB 1 @profile
35 def render_page_to_image(file_name):
36 80.5 MiB 0.0 MiB 1 file_stream = read_file(file_name)
37 80.5 MiB 0.0 MiB 1 doc = fitz.open(stream=file_stream, filetype="pdf")
38 80.5 MiB 0.0 MiB 1 try:
39 80.5 MiB 0.0 MiB 1 number_of_pages = doc.page_count
40 80.5 MiB 0.0 MiB 4 for i in range(number_of_pages):
41 80.5 MiB 0.0 MiB 3 page = doc.load_page(i)
42
43 # Render the page to a pixmap (an image)
44 80.5 MiB 0.0 MiB 3 pix = page.get_pixmap()
45 80.5 MiB 0.0 MiB 3 img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
46 80.5 MiB 0.0 MiB 3 pix = None
47 80.5 MiB 0.0 MiB 3 fitz.TOOLS.store_shrink(100)
48 # Convert the PIL Image to a bytes-like object
49 80.5 MiB 0.0 MiB 3 img_byte_buff = BytesIO()
50 80.5 MiB 0.0 MiB 3 img.save(img_byte_buff, format='JPEG')
51 80.5 MiB 0.0 MiB 3 img_byte_arr = img_byte_buff.getvalue()
52
53 # Encode the image bytes in base64 and decode to UTF-8 string
54 80.5 MiB 0.0 MiB 3 rendered_image = base64.b64encode(img_byte_arr).decode('utf-8')
55
56 except Exception as e:
57 raise Exception(e.args)
58 finally:
59 80.5 MiB 0.0 MiB 1 doc.close()
60 80.5 MiB 0.0 MiB 1 fitz.TOOLS.store_shrink(100)
Memory usage after function: 80.53 MB Memory usage difference total: 0.00 MB
I encountered the same issue! Memory leak! I wrote a service using PyMuPDF to parse PDFs. Despite using fitz.TOOLS.store_shrink(100) each time, the service crashes due to memory leak after running for a period of time.
try:
with fitz.Document(stream=data, filetype="pdf") as doc:
...
except Exception as e:
logging...
finally:
fitz.TOOLS.store_shrink(100)
gc.collect()
other code:
zoom_x = request.imgsz / page_width
zoom_y = request.imgsz / page_height
zoom = min(zoom_x, zoom_y)
mat = fitz.Matrix(zoom, zoom)
pix = page.get_pixmap(matrix=mat, colorspace="rgb", alpha=False)
Another issue: why does calling the page.get_image_rects function return a large number of images (over 40,000), when there are no visible images on that PDF page?
I'm looking for this PDF. I'll share it once I find it.
Another issue: why does calling the page.get_image_rects function return a large number of images (over 40,000), when there are no visible images on that PDF page?
I'm looking for this PDF. I'll share it once I find it.
Please do not mix different things in the same report! If you find that example please open a separate issue.
Another issue: why does calling the page.get_image_rects function return a large number of images (over 40,000), when there are no visible images on that PDF page?
I'm looking for this PDF. I'll share it once I find it.
It seems that your "issue" goes back to that Page.get_image_infos()
uses no restriction on the image bboxes and will include any image - even if it only intersects the MediaBox (and is not fully contained).
Whereas text extractions restrict results (text or image) to objects contained in the MediaBox.
If you for whatever reason need coincidence make sure to adjust the clip to the same value in both cases.
Another issue: why does calling the page.get_image_rects function return a large number of images (over 40,000), when there are no visible images on that PDF page? I'm looking for this PDF. I'll share it once I find it.
It seems that your "issue" goes back to that
Page.get_image_infos()
uses no restriction on the image bboxes and will include any image - even if it only intersects the MediaBox (and is not fully contained). Whereas text extractions restrict results (text or image) to objects contained in the MediaBox. If you for whatever reason need coincidence make sure to adjust the clip to the same value in both cases.
Thank you very much, I will give it a try.
Description of the bug
When processing larger PDF files the page.get_pixmap() method significantly increases memory usage and does not release it properly after completion. It results in a high memory footprint that persists until an even larger file is processed. This behavior can be observed from the memory profiling data provided below.
I implemented the operation as a function that is called in cycle for each file. I set pix = None for each page and call doc.close() and fitz.TOOLS.store_shrink(100) for each document as was suggested in a similar issue here https://github.com/pymupdf/PyMuPDF/issues/1430 One can see that sugnificant increase in memory usage occurred while processing file f1 and a high memory footprint persisted while processing later files.
If there is a method I could call to release the memory please let me know.
Relevant closed issue https://github.com/pymupdf/PyMuPDF/issues/1430.
processing file f0
Memory usage before function: 34.70 MB
Line # Mem usage Increment Occurrences Line Contents
Memory usage after function: 39.10 MB Memory usage difference total: 4.41 MB
processing file f1
Memory usage before function: 39.10 MB
Line # Mem usage Increment Occurrences Line Contents
Memory usage after function: 301.36 MB Memory usage difference total: 262.26 MB
processing file f2
Memory usage before function: 301.36 MB
Line # Mem usage Increment Occurrences Line Contents
Memory usage after function: 301.36 MB Memory usage difference total: 0.00 MB
processing file f3
Memory usage before function: 301.36 MB
Line # Mem usage Increment Occurrences Line Contents
Memory usage after function: 301.36 MB Memory usage difference total: 0.00 MB
How to reproduce the bug
PyMuPDF version
1.23.x or earlier
Operating system
Linux
Python version
3.11