masaccio / numbers-parser

Python module for parsing Apple Numbers .numbers files
MIT License
201 stars 14 forks source link

numbers-parser leaks(?) memory. #67

Closed NAThompson closed 9 months ago

NAThompson commented 9 months ago

Describe the bug

Memory consumption is excessive.

To Reproduce

pip install memray
python3 -m memray run -o output.bin test_numbers.py
python3 -m memray flamegraph output.bin
open output.html

Python file:

import sys
from numbers_parser import Document
import gc

for i in range(32):
    doc = Document("test.numbers")
    print(i)
    del doc         # does not help
    gc.collect()    # does not help

print("[return] to exit")
sys.stdin.readline()

Expected behavior

Less memory consumption.

Attachments

(attached-had to rename it with .txt extension to get it past github.)

test.numbers.txt

Additional information

$ 
pip show numbers-parser
Name: numbers-parser
Version: 4.4.4
Summary: Read and write Apple Numbers spreadsheets
Home-page: https://github.com/masaccio/numbers-parser
masaccio commented 9 months ago

Initial analysis says this is functools.lru_cache holding onto references. Since __del__ won't reliably get called to keep memory usage low (certainly it's not in my testing), would a method such as Document.release() to force a free up of memory through cleaning the cache be useful for you? I did look at a weakref version of caching, but it's horribly inefficient making performance 2-3x slower for everyone. The alternative would be to explicitly cache in each class method without using lru_cache.

NAThompson commented 9 months ago

@masaccio : I think such a method would be useful. However I worry you'll get this same bug report many times and have to explain over and over again!

That said, my interest is more in ecosytem improvement than in any particular solution. So a more interesting question to me: Do you think you have enough evidence to submit a bug against functool.lru_cache?

masaccio commented 9 months ago

I decided to just rewrite functools.lru_cache to give me control over the cache references. This code:

from numbers_parser import Document
from psutil import Process

process = Process()
rss_base = process.memory_info().rss

for i in range(32):
    doc = Document("test.numbers")
    print("memory:", int((process.memory_info().rss - rss_base) / (1024 * 1024)), "MB")

stabilises around 200MB of usage on my Mac, but calling gc.collect() in the loop doesn't change that much so there may still be some leaks, but certainly far less than before.

masaccio commented 9 months ago

That said:

from pympler.tracker import SummaryTracker
from numbers_parser import Document

def func():
    doc = Document("test.numbers")

tracker = SummaryTracker()
for i in range(5):
    func()
tracker.print_diff()

Indicates that there's not many references left around:

                   types |   # objects |   total size
======================== | =========== | ============
                    list |        4685 |    406.86 KB
                     str |        4693 |    332.30 KB
                     int |         954 |     26.11 KB
                    dict |          15 |     21.27 KB
                    code |           5 |      9.64 KB
                    type |           5 |      6.80 KB
             abc.ABCMeta |           0 |    944     B
                   tuple |          13 |    712     B
   weakref.ReferenceType |           5 |    360     B
       function (decode) |           2 |    288     B
       function (encode) |           2 |    288     B
                  method |           3 |    192     B
   function (store_info) |           1 |    144     B
  function (getregentry) |           1 |    144     B
   encodings.cp437.Codec |           2 |     96     B
NAThompson commented 9 months ago

@masaccio : I have verified that on 4.4.4, memory use increases monotonically, yet on 4.4.5, the memory use stays bounded.

Thanks so much!