python / cpython

The Python programming language
https://www.python.org
Other
62.9k stars 30.13k forks source link

Json decode memory leak #117717

Closed FrancescoManfredi closed 6 months ago

FrancescoManfredi commented 6 months ago

Bug report

Bug description:

While decoding with the json module, a (comparatively) huge amount of memory is allocated and never deallocated. The following example shows that up to 291MiB of memory remain allocated to decode 782.21KiB of actual data.
Furthermore an explicit call to gc.collect() won't find any unreachable objects.

To reproduce the behavior (outputs on a file specified in an OUTPUT_FILE env var):

import json
import sys
import tracemalloc
import os

json_data = """{
    "glossary": {
        "title": "example glossary",
        "GlossDiv": {
            "title": "S",
            "GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
                    "SortAs": "SGML",
                    "GlossTerm": "Standard Generalized Markup Language",
                    "Acronym": "SGML",
                    "Abbrev": "ISO 8879:1986",
                    "GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
                        "GlossSeeAlso": ["GML", "XML"]
                    },
                    "GlossSee": "markup"
                }
            }
        }
    }
}"""

output_lines = []

tracemalloc.start()

output_lines.append("Deconding json data...")
all_copies = []
for i in range(100000):
    all_copies.append(json.loads(json_data))

output_lines.append("FINAL MALLOC STATS:")
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
for stat in top_stats[:3]:
    output_lines.append(str(stat))
output_lines.append("---")

output_lines.append(f"Decoded data is {sys.getsizeof(all_copies)/1024:.2f}Kb")

with open(os.getenv("OUTPUT_FILE"), "w") as f:
    f.writelines([l+"\n" for l in output_lines])

Output on 3.8:

Deconding json data...
FINAL MALLOC STATS:
/usr/local/lib/python3.8/json/decoder.py:353: size=291 MiB, count=3799966, average=80 B
./reproduce.py:36: size=805 KiB, count=1, average=805 KiB
/usr/local/lib/python3.8/json/decoder.py:337: size=512 B, count=1, average=512 B
---
Decoded data is 805.13Kb

3.9:

Deconding json data...
FINAL MALLOC STATS:
/usr/local/lib/python3.9/json/decoder.py:353: size=291 MiB, count=3799958, average=80 B
/usr/src/app/./reproduce.py:36: size=782 KiB, count=1, average=782 KiB
/usr/local/lib/python3.9/json/decoder.py:337: size=512 B, count=1, average=512 B
---
Decoded data is 782.21Kb

3.10:

Deconding json data...
FINAL MALLOC STATS:
/usr/local/lib/python3.10/json/decoder.py:353: size=291 MiB, count=3799959, average=80 B
/usr/src/app/./reproduce.py:36: size=782 KiB, count=1, average=782 KiB
/usr/local/lib/python3.10/json/__init__.py:299: size=512 B, count=2, average=256 B
---
Decoded data is 782.21Kb

3.11:

Deconding json data...
FINAL MALLOC STATS:
/usr/local/lib/python3.11/json/decoder.py:353: size=259 MiB, count=3799963, average=72 B
/usr/src/app/./reproduce.py:36: size=782 KiB, count=1, average=782 KiB
/usr/local/lib/python3.11/tracemalloc.py:558: size=56 B, count=1, average=56 B
---
Decoded data is 782.21Kb

3.12:

Deconding json data...
FINAL MALLOC STATS:
/usr/local/lib/python3.12/json/decoder.py:353: size=241 MiB, count=3799954, average=66 B
/usr/src/app/./reproduce.py:36: size=782 KiB, count=1, average=782 KiB
/usr/local/lib/python3.12/tracemalloc.py:558: size=56 B, count=1, average=56 B
---
Decoded data is 782.21Kb

CPython versions tested on:

3.8, 3.9, 3.10, 3.11, 3.12

Operating systems tested on:

Linux

FrancescoManfredi commented 6 months ago

I think I was wrong. My data is actually big and I just did not know that sys.getsizeof() does not recursively look inside it. I guess this can be closed. Sry :)