Memory leak when using docx.Document() with io.Bytes

Mathacc commented 4 months ago

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    37     65.7 MiB     65.7 MiB           1   @profile
    38                                         def process_docx(docx_stream: IO[bytes]) -> list[str]:
    39                                             """
    40                                             Process a DOCX document and extract text and images.
    41
    42                                             Args:
    43                                                 docx_stream (IO[bytes]): The input DOCX file stream.
    44
    45                                             Returns:
    46                                                 list[str]: A list of extracted text and images from the document.
    47                                             """
    48     65.8 MiB      0.0 MiB           1       zip_file = zipfile.ZipFile(docx_stream)
    49     71.3 MiB      5.6 MiB           1       document = docx.Document(docx_stream)
    50                                             #docx_stream.close()
    51
    52     71.3 MiB      0.0 MiB           1       image_rels = {}
    53     71.3 MiB      0.0 MiB          28       for rel in document.part.rels.values():
    54     71.3 MiB      0.0 MiB          27           if isinstance(rel._target, docx.parts.image.ImagePart):
    55     71.3 MiB      0.0 MiB           1               image_rels[rel.rId] = rel._target.partname
    56
    57     71.3 MiB      0.0 MiB           1       image_paragraphs = []
    58     71.3 MiB      0.0 MiB           1       text_pages = []
    59     71.3 MiB      0.0 MiB           1       current_page_text = ''
    60     75.2 MiB      0.1 MiB         129       for element in document.iter_inner_content():
    61     75.2 MiB      0.0 MiB         128           if isinstance(element, docx.text.paragraph.Paragraph):
    62     75.2 MiB      0.0 MiB         125               current_page_text += element.text + '\n'
    63     75.2 MiB      0.1 MiB         125               if 'graphicData' in element._p.xml or 'Graphic' in element._p.xml:

There is an increment in size there, but the memory is never released even when the function ends its execution, any solutions?

PS: I tested and with file path it has the same issue

scanny commented 4 months ago

Not sure how to interpret your figure, can you point out more specifically what you're seeing there and what you expect it to be?

Mathacc commented 4 months ago

Sorry, It was not releasing memory, but once I added gc.collector after calling the function it cleaned it.

Yaadon commented 1 month ago

The same problem, and using gc.collect() solves it.

python-openxml / python-docx

Memory leak when using docx.Document() with io.Bytes #1364