lk-geimfari / mimesis

Mimesis is a robust data generator for Python that can produce a wide range of fake data in multiple languages.
https://mimesis.name
MIT License
4.44k stars 336 forks source link

Potential memory leak in Field #1190

Closed pedromb closed 2 years ago

pedromb commented 2 years ago

Bug report

What's wrong

When creating a Field inside a loop the memory doesn't seem to be released after the loop completes. Not sure if this is expected behaviour, but I would assume not.

Here is a reproducible example to highlight, using memory_profiler to check for memory usage and matplotlib for plotting the memory usage graph..

from mimesis import Field
from memory_profiler import memory_usage
import matplotlib.pyplot as plt

def schema(f):
    return {
        "id": f("uuid"),
        "name": f("full_name"),
        "email": f("person.email"),
        "timestamp": f("timestamp", posix=False),
        "car_model": f("car"),
        "address": {"full_address": f("address"), "city": f("city"), "zip_code": f("zip_code")},
    }

def test_field_inside_loop():
    for _ in range(1000):
        f = Field("en")
        schema(f)

def test_field_outside_loop():
    f = Field("en")
    for _ in range(1000):
        schema(f)

def plot(function_to_plot):
    mem_usage = memory_usage((function_to_plot), interval=.01)
    plt.plot(mem_usage)
    plt.show()

plot(test_field_outside_loop)
plot(test_field_inside_loop)

Memory keeps piling up when Field is created inside the loop.

How is that should be

Memory should be released after the field is used within a loop.

System information

macOS Monterey 12.2.1 M1, 2020 Python 3.8.8 mimesis 5.1.0

Pretty sure it doesn't depend on the system, had same issue on a Linux (Ubuntu 20.04) machine.

vbuaraujo commented 2 years ago

From looking at the code, it seems that the problem is the usage of functools.lru_cache in https://github.com/lk-geimfari/mimesis/blob/master/mimesis/providers/base.py#L126. If I understand correctly, this is meant to load the JSON data only once, but the problem is that the cache is keyed on all function arguments, including self, which will be different for every instance. So this cache doesn't really do anything useful (each instance will load the data again), and also keeps the data from every instance around indefinitely.

A quick solution is to just remove lru_cache from this function. It won't cache anything, but the current cache is not useful anyway, and it will free memory after usage.

A possibly better solution is to move all parts of that function that don't depend on self to a staticmethod and cache the staticmethod instead.

lk-geimfari commented 2 years ago

@vbuaraujo Correct, but instancing in the loop is absolutely unnecessary, so I'm not sure that we should do something about this problem.

lk-geimfari commented 2 years ago

Well, I've fixed this issue. I've removed lru_cache since it works even faster without it. I'll release 6.0.0 soon.