pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.66k stars 528 forks source link

Memory Retention with fitz.page.get_pixmap() #3625

Open nataliia-obraztsova opened 4 months ago

nataliia-obraztsova commented 4 months ago

Description of the bug

When processing larger PDF files the page.get_pixmap() method significantly increases memory usage and does not release it properly after completion. It results in a high memory footprint that persists until an even larger file is processed. This behavior can be observed from the memory profiling data provided below.

I implemented the operation as a function that is called in cycle for each file. I set pix = None for each page and call doc.close() and fitz.TOOLS.store_shrink(100) for each document as was suggested in a similar issue here https://github.com/pymupdf/PyMuPDF/issues/1430 One can see that sugnificant increase in memory usage occurred while processing file f1 and a high memory footprint persisted while processing later files.

If there is a method I could call to release the memory please let me know.

Relevant closed issue https://github.com/pymupdf/PyMuPDF/issues/1430.

processing file f0

Memory usage before function: 34.70 MB

Line # Mem usage Increment Occurrences Line Contents

34     34.9 MiB     34.9 MiB           1   @profile
35                                         def render_page_to_image(file_name):
36     35.1 MiB      0.1 MiB           1       file_stream = read_file(file_name)
37     35.7 MiB      0.6 MiB           1       doc = fitz.open(stream=file_stream, filetype="pdf")
38     35.7 MiB      0.0 MiB           1       try:
39     35.7 MiB      0.0 MiB           1           number_of_pages = doc.page_count
40     46.6 MiB      0.0 MiB           4           for i in range(number_of_pages):
41     46.6 MiB      0.2 MiB           3               page = doc.load_page(i)
42                                         
43                                                     # Render the page to a pixmap (an image)
44     46.6 MiB      6.4 MiB           3               pix = page.get_pixmap()
45     46.6 MiB      4.8 MiB           3               img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
46     46.6 MiB     -1.3 MiB           3               pix = None
47                                                     # Convert the PIL Image to a bytes-like object
48     46.6 MiB      0.0 MiB           3               img_byte_buff = BytesIO()
49     46.6 MiB      0.8 MiB           3               img.save(img_byte_buff, format='JPEG')
50     46.6 MiB      0.0 MiB           3               img_byte_arr = img_byte_buff.getvalue()
51                                         
52                                                     # Encode the image bytes in base64 and decode to UTF-8 string
53     46.6 MiB      0.0 MiB           3               rendered_image = base64.b64encode(img_byte_arr).decode('utf-8')
54                                         
55                                             except Exception as e:
56                                                 raise Exception(e.args)
57                                             finally:
58     46.6 MiB      0.0 MiB           1           doc.close()
59     46.6 MiB      0.0 MiB           1           fitz.TOOLS.store_shrink(100)

Memory usage after function: 39.10 MB Memory usage difference total: 4.41 MB

processing file f1

Memory usage before function: 39.10 MB

Line # Mem usage Increment Occurrences Line Contents

34     39.1 MiB     39.1 MiB           1   @profile
35                                         def render_page_to_image(file_name):
36     44.4 MiB      5.2 MiB           1       file_stream = read_file(file_name)
37     44.4 MiB      0.0 MiB           1       doc = fitz.open(stream=file_stream, filetype="pdf")
38     44.4 MiB      0.0 MiB           1       try:
39     44.4 MiB      0.0 MiB           1           number_of_pages = doc.page_count
40    343.5 MiB    -11.1 MiB          33           for i in range(number_of_pages):
41    343.3 MiB    -11.1 MiB          32               page = doc.load_page(i)
42                                         
43                                                     # Render the page to a pixmap (an image)
44    343.5 MiB    288.0 MiB          32               pix = page.get_pixmap()
45    343.5 MiB    -11.1 MiB          32               img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
46    343.5 MiB    -11.1 MiB          32               pix = None
47                                                     # Convert the PIL Image to a bytes-like object
48    343.5 MiB    -11.1 MiB          32               img_byte_buff = BytesIO()
49    343.5 MiB    -11.1 MiB          32               img.save(img_byte_buff, format='JPEG')
50    343.5 MiB    -11.1 MiB          32               img_byte_arr = img_byte_buff.getvalue()
51                                         
52                                                     # Encode the image bytes in base64 and decode to UTF-8 string
53    343.5 MiB    -11.1 MiB          32               rendered_image = base64.b64encode(img_byte_arr).decode('utf-8')
54                                         
55                                             except Exception as e:
56                                                 raise Exception(e.args)
57                                             finally:
58    306.7 MiB    -36.8 MiB           1           doc.close()
59    306.7 MiB      0.0 MiB           1           fitz.TOOLS.store_shrink(100)

Memory usage after function: 301.36 MB Memory usage difference total: 262.26 MB

processing file f2

Memory usage before function: 301.36 MB

Line # Mem usage Increment Occurrences Line Contents

34    301.4 MiB    301.4 MiB           1   @profile
35                                         def render_page_to_image(file_name):
36    301.4 MiB      0.0 MiB           1       file_stream = read_file(file_name)
37    301.4 MiB      0.0 MiB           1       doc = fitz.open(stream=file_stream, filetype="pdf")
38    301.4 MiB      0.0 MiB           1       try:
39    301.4 MiB      0.0 MiB           1           number_of_pages = doc.page_count
40    301.4 MiB      0.0 MiB           4           for i in range(number_of_pages):
41    301.4 MiB      0.0 MiB           3               page = doc.load_page(i)
42                                         
43                                                     # Render the page to a pixmap (an image)
44    301.4 MiB      0.0 MiB           3               pix = page.get_pixmap()
45    301.4 MiB      0.0 MiB           3               img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
46    301.4 MiB      0.0 MiB           3               pix = None
47                                                     # Convert the PIL Image to a bytes-like object
48    301.4 MiB      0.0 MiB           3               img_byte_buff = BytesIO()
49    301.4 MiB      0.0 MiB           3               img.save(img_byte_buff, format='JPEG')
50    301.4 MiB      0.0 MiB           3               img_byte_arr = img_byte_buff.getvalue()
51                                         
52                                                     # Encode the image bytes in base64 and decode to UTF-8 string
53    301.4 MiB      0.0 MiB           3               rendered_image = base64.b64encode(img_byte_arr).decode('utf-8')
54                                         
55                                             except Exception as e:
56                                                 raise Exception(e.args)
57                                             finally:
58    301.4 MiB      0.0 MiB           1           doc.close()
59    301.4 MiB      0.0 MiB           1           fitz.TOOLS.store_shrink(100)

Memory usage after function: 301.36 MB Memory usage difference total: 0.00 MB

processing file f3

Memory usage before function: 301.36 MB

Line # Mem usage Increment Occurrences Line Contents

34    301.4 MiB    301.4 MiB           1   @profile
35                                         def render_page_to_image(file_name):
36    301.4 MiB      0.0 MiB           1       file_stream = read_file(file_name)
37    301.4 MiB      0.0 MiB           1       doc = fitz.open(stream=file_stream, filetype="pdf")
38    301.4 MiB      0.0 MiB           1       try:
39    301.4 MiB      0.0 MiB           1           number_of_pages = doc.page_count
40    301.4 MiB      0.0 MiB           4           for i in range(number_of_pages):
41    301.4 MiB      0.0 MiB           3               page = doc.load_page(i)
42                                         
43                                                     # Render the page to a pixmap (an image)
44    301.4 MiB      0.0 MiB           3               pix = page.get_pixmap()
45    301.4 MiB      0.0 MiB           3               img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
46    301.4 MiB      0.0 MiB           3               pix = None
47                                                     # Convert the PIL Image to a bytes-like object
48    301.4 MiB      0.0 MiB           3               img_byte_buff = BytesIO()
49    301.4 MiB      0.0 MiB           3               img.save(img_byte_buff, format='JPEG')
50    301.4 MiB      0.0 MiB           3               img_byte_arr = img_byte_buff.getvalue()
51                                         
52                                                     # Encode the image bytes in base64 and decode to UTF-8 string
53    301.4 MiB      0.0 MiB           3               rendered_image = base64.b64encode(img_byte_arr).decode('utf-8')
54                                         
55                                             except Exception as e:
56                                                 raise Exception(e.args)
57                                             finally:
58    301.4 MiB      0.0 MiB           1           doc.close()
59    301.4 MiB      0.0 MiB           1           fitz.TOOLS.store_shrink(100)

Memory usage after function: 301.36 MB Memory usage difference total: 0.00 MB

How to reproduce the bug

def read_file(file_name):
    try:
        file = open(file_name, 'rb')
        file_content = file.read()
        file_stream = BytesIO(file_content)
        return file_stream
    except Exception as e:
        raise Exception(f"There was an error processing the file(s) {e.args}")
    finally:
        if file:
            file.close()

def render_page_to_image(file_name):
    file_stream = read_file(file_name)
    doc = fitz.open(stream=file_stream, filetype="pdf")
    try:
        number_of_pages = doc.page_count
        for i in range(number_of_pages):
            page = doc.load_page(i)

            # Render the page to a pixmap (an image)
            pix = page.get_pixmap()
            img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
            pix = None
            # Convert the PIL Image to a bytes-like object
            img_byte_buff = BytesIO()
            img.save(img_byte_buff, format='JPEG')
            img_byte_arr = img_byte_buff.getvalue()

            # Encode the image bytes in base64 and decode to UTF-8 string
            rendered_image = base64.b64encode(img_byte_arr).decode('utf-8')

    except Exception as e:
        raise Exception(e.args)
    finally:
        doc.close()
        fitz.TOOLS.store_shrink(100)

for i in range (4):
    file_name = 'xxx.pdf_{i}.pdf'
    render_page_to_image(file_name)

PyMuPDF version

1.23.x or earlier

Operating system

Linux

Python version

3.11

nataliia-obraztsova commented 4 months ago

Adding fitz.TOOLS.store_shrink(100) after pix = None actually helped a lot. Here is a link to an older issue which I missed at first https://github.com/pymupdf/PyMuPDF/issues/130 I still have some gradual increase so I'll leave the issue open for now.

JorjMcKie commented 4 months ago

Can you please provide printouts with numbers updated after the mentioned adjustments?

In general, if a permanently low memory footprint is desired (for whatever reasons), shrinking the store usage should be used generously. This is because of a number of reasons:

  1. MuPDF's strategy is to keep things in memory - especially objects that are prone to be large like images and fonts
  2. Deleting Python objects is only one side of the medal: the shadowing C-object in MuPDF is not necessarily also removed in each case.
nataliia-obraztsova commented 4 months ago

Below you can see memory profiling after adjustments. The interesting thing is that while processing the file f0 fitz.TOOLS.store_shrink(100) in line 47 seems to made no difference, but memory usage increased only by 7MiB. And didn't shrink back to initial number. While processing file f1, fitz.TOOLS.store_shrink(100) in line 47 reduced memory usage a lot. But still not all of it. Additional 20.12 MB added up. Then it seems to plateau.

P.S. I have upgraded PyMuPDF to 1.24.7

memory profiling after adjustments

processing file f0

Memory usage before function: 53.28 MB

Line # Mem usage Increment Occurrences Line Contents

34     53.5 MiB     53.5 MiB           1   @profile
35                                         def render_page_to_image(file_name):
36     53.7 MiB      0.1 MiB           1       file_stream = read_file(file_name)
37     56.0 MiB      2.4 MiB           1       doc = fitz.open(stream=file_stream, filetype="pdf")
38     56.0 MiB      0.0 MiB           1       try:
39     56.0 MiB      0.0 MiB           1           number_of_pages = doc.page_count
40     67.4 MiB      0.0 MiB           4           for i in range(number_of_pages):
41     67.4 MiB      0.2 MiB           3               page = doc.load_page(i)
42                                         
43                                                     # Render the page to a pixmap (an image)
44     67.4 MiB      7.0 MiB           3               pix = page.get_pixmap()
45     67.4 MiB      3.5 MiB           3               img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
46     67.4 MiB      0.0 MiB           3               pix = None
47     67.4 MiB      0.0 MiB           3               fitz.TOOLS.store_shrink(100)
48                                                     # Convert the PIL Image to a bytes-like object
49     67.4 MiB      0.0 MiB           3               img_byte_buff = BytesIO()
50     67.4 MiB      0.6 MiB           3               img.save(img_byte_buff, format='JPEG')
51     67.4 MiB      0.0 MiB           3               img_byte_arr = img_byte_buff.getvalue()
52                                         
53                                                     # Encode the image bytes in base64 and decode to UTF-8 string
54     67.4 MiB      0.0 MiB           3               rendered_image = base64.b64encode(img_byte_arr).decode('utf-8')
55                                         
56                                             except Exception as e:
57                                                 raise Exception(e.args)
58                                             finally:
59     67.4 MiB      0.0 MiB           1           doc.close()
60     67.4 MiB      0.0 MiB           1           fitz.TOOLS.store_shrink(100)

Memory usage after function: 60.41 MB Memory usage difference total: 7.13 MB

processing file f1

Memory usage before function: 60.41 MB

Line # Mem usage Increment Occurrences Line Contents

34     60.4 MiB     60.4 MiB           1   @profile
35                                         def render_page_to_image(file_name):
36     65.7 MiB      5.2 MiB           1       file_stream = read_file(file_name)
37     65.7 MiB      0.0 MiB           1       doc = fitz.open(stream=file_stream, filetype="pdf")
38     65.7 MiB      0.0 MiB           1       try:
39     65.7 MiB      0.0 MiB           1           number_of_pages = doc.page_count
40    100.4 MiB    -70.7 MiB          33           for i in range(number_of_pages):
41    100.4 MiB    -56.1 MiB          32               page = doc.load_page(i)
42                                         
43                                                     # Render the page to a pixmap (an image)
44    145.3 MiB    194.0 MiB          32               pix = page.get_pixmap()
45    145.3 MiB   -289.6 MiB          32               img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
46    145.3 MiB   -289.6 MiB          32               pix = None
47    100.4 MiB   -519.4 MiB          32               fitz.TOOLS.store_shrink(100)
48                                                     # Convert the PIL Image to a bytes-like object
49    100.4 MiB    -70.7 MiB          32               img_byte_buff = BytesIO()
50    100.4 MiB    -70.7 MiB          32               img.save(img_byte_buff, format='JPEG')
51    100.4 MiB    -70.7 MiB          32               img_byte_arr = img_byte_buff.getvalue()
52                                         
53                                                     # Encode the image bytes in base64 and decode to UTF-8 string
54    100.4 MiB    -70.7 MiB          32               rendered_image = base64.b64encode(img_byte_arr).decode('utf-8')
55                                         
56                                             except Exception as e:
57                                                 raise Exception(e.args)
58                                             finally:
59     85.8 MiB    -14.6 MiB           1           doc.close()
60     85.8 MiB      0.0 MiB           1           fitz.TOOLS.store_shrink(100)

Memory usage after function: 80.53 MB Memory usage difference total: 20.12 MB

processing file f2

Memory usage before function: 80.53 MB

Line # Mem usage Increment Occurrences Line Contents

34     80.5 MiB     80.5 MiB           1   @profile
35                                         def render_page_to_image(file_name):
36     80.5 MiB      0.0 MiB           1       file_stream = read_file(file_name)
37     80.5 MiB      0.0 MiB           1       doc = fitz.open(stream=file_stream, filetype="pdf")
38     80.5 MiB      0.0 MiB           1       try:
39     80.5 MiB      0.0 MiB           1           number_of_pages = doc.page_count
40     80.5 MiB      0.0 MiB           4           for i in range(number_of_pages):
41     80.5 MiB      0.0 MiB           3               page = doc.load_page(i)
42                                         
43                                                     # Render the page to a pixmap (an image)
44     80.5 MiB      0.0 MiB           3               pix = page.get_pixmap()
45     80.5 MiB      0.0 MiB           3               img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
46     80.5 MiB      0.0 MiB           3               pix = None
47     80.5 MiB      0.0 MiB           3               fitz.TOOLS.store_shrink(100)
48                                                     # Convert the PIL Image to a bytes-like object
49     80.5 MiB      0.0 MiB           3               img_byte_buff = BytesIO()
50     80.5 MiB      0.0 MiB           3               img.save(img_byte_buff, format='JPEG')
51     80.5 MiB      0.0 MiB           3               img_byte_arr = img_byte_buff.getvalue()
52                                         
53                                                     # Encode the image bytes in base64 and decode to UTF-8 string
54     80.5 MiB      0.0 MiB           3               rendered_image = base64.b64encode(img_byte_arr).decode('utf-8')
55                                         
56                                             except Exception as e:
57                                                 raise Exception(e.args)
58                                             finally:
59     80.5 MiB      0.0 MiB           1           doc.close()
60     80.5 MiB      0.0 MiB           1           fitz.TOOLS.store_shrink(100)

Memory usage after function: 80.53 MB Memory usage difference total: 0.00 MB

processing file f3

Memory usage before function: 80.53 MB

Line # Mem usage Increment Occurrences Line Contents

34     80.5 MiB     80.5 MiB           1   @profile
35                                         def render_page_to_image(file_name):
36     80.5 MiB      0.0 MiB           1       file_stream = read_file(file_name)
37     80.5 MiB      0.0 MiB           1       doc = fitz.open(stream=file_stream, filetype="pdf")
38     80.5 MiB      0.0 MiB           1       try:
39     80.5 MiB      0.0 MiB           1           number_of_pages = doc.page_count
40     80.5 MiB      0.0 MiB           4           for i in range(number_of_pages):
41     80.5 MiB      0.0 MiB           3               page = doc.load_page(i)
42                                         
43                                                     # Render the page to a pixmap (an image)
44     80.5 MiB      0.0 MiB           3               pix = page.get_pixmap()
45     80.5 MiB      0.0 MiB           3               img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
46     80.5 MiB      0.0 MiB           3               pix = None
47     80.5 MiB      0.0 MiB           3               fitz.TOOLS.store_shrink(100)
48                                                     # Convert the PIL Image to a bytes-like object
49     80.5 MiB      0.0 MiB           3               img_byte_buff = BytesIO()
50     80.5 MiB      0.0 MiB           3               img.save(img_byte_buff, format='JPEG')
51     80.5 MiB      0.0 MiB           3               img_byte_arr = img_byte_buff.getvalue()
52                                         
53                                                     # Encode the image bytes in base64 and decode to UTF-8 string
54     80.5 MiB      0.0 MiB           3               rendered_image = base64.b64encode(img_byte_arr).decode('utf-8')
55                                         
56                                             except Exception as e:
57                                                 raise Exception(e.args)
58                                             finally:
59     80.5 MiB      0.0 MiB           1           doc.close()
60     80.5 MiB      0.0 MiB           1           fitz.TOOLS.store_shrink(100)

Memory usage after function: 80.53 MB Memory usage difference total: 0.00 MB

yoliax commented 4 months ago

I encountered the same issue! Memory leak! I wrote a service using PyMuPDF to parse PDFs. Despite using fitz.TOOLS.store_shrink(100) each time, the service crashes due to memory leak after running for a period of time.

try:
    with fitz.Document(stream=data, filetype="pdf") as doc:
        ...
except Exception as e:
    logging...
finally:
    fitz.TOOLS.store_shrink(100)
    gc.collect()

other code:

zoom_x = request.imgsz / page_width
zoom_y = request.imgsz  / page_height
zoom = min(zoom_x, zoom_y)

mat = fitz.Matrix(zoom, zoom)
pix = page.get_pixmap(matrix=mat, colorspace="rgb", alpha=False)
yoliax commented 4 months ago

Another issue: why does calling the page.get_image_rects function return a large number of images (over 40,000), when there are no visible images on that PDF page?

I'm looking for this PDF. I'll share it once I find it.

JorjMcKie commented 4 months ago

Another issue: why does calling the page.get_image_rects function return a large number of images (over 40,000), when there are no visible images on that PDF page?

I'm looking for this PDF. I'll share it once I find it.

Please do not mix different things in the same report! If you find that example please open a separate issue.

JorjMcKie commented 4 months ago

Another issue: why does calling the page.get_image_rects function return a large number of images (over 40,000), when there are no visible images on that PDF page?

I'm looking for this PDF. I'll share it once I find it.

It seems that your "issue" goes back to that Page.get_image_infos() uses no restriction on the image bboxes and will include any image - even if it only intersects the MediaBox (and is not fully contained). Whereas text extractions restrict results (text or image) to objects contained in the MediaBox. If you for whatever reason need coincidence make sure to adjust the clip to the same value in both cases.

yoliax commented 4 months ago

Another issue: why does calling the page.get_image_rects function return a large number of images (over 40,000), when there are no visible images on that PDF page? I'm looking for this PDF. I'll share it once I find it.

It seems that your "issue" goes back to that Page.get_image_infos() uses no restriction on the image bboxes and will include any image - even if it only intersects the MediaBox (and is not fully contained). Whereas text extractions restrict results (text or image) to objects contained in the MediaBox. If you for whatever reason need coincidence make sure to adjust the clip to the same value in both cases.

Thank you very much, I will give it a try.