jazzband / jsonmodels

jsonmodels is library to make it easier for you to deal with structures that are converted to, or read from JSON.
http://jsonmodels.readthedocs.org/en/latest/
BSD 3-Clause "New" or "Revised" License
334 stars 51 forks source link

Storage of data in field leaks memory #64

Closed grainednoise closed 8 years ago

grainednoise commented 8 years ago

Bug

Any class derived from jsonmodels.models.Base with at least one field in it (e.g.: jsonmodels.fields.StringField) will leak all instances of the class it is used in, provided the field is assigned a value at least once.

General

The following code will demonstrate the problem. Just let it run, and watch the Python process consume more and more memory.

from jsonmodels.fields import StringField
from jsonmodels.models import Base

class Test(Base):
    name = StringField()

while True:
    t = Test(name="leak")

Problem description

The problem is that all instances of BaseField have a field called _memory in it containing a dict of all (instance, value) pairs in it. This is, in my opinion, a poor design decision for several reasons, other than the resulting memory leak itself:

Getting back to the original bug: the obvious way of trying to alleviate the symptoms is by using a WeakKeyDictionary here, but that would not address the deeper problem, although it could be used as a stopgap measure.

In my opinion, the best approach would be to use MetaClasses here in order to know what name will be used for the field. Knowing that name will make it possible to either create unique attribute names, or to use those names directly as keys in a dict on the instance itself.

beregond commented 8 years ago

Hey @grainednoise, thanks for bug report with precise description. I'll surely look into this closer.

beregond commented 8 years ago

@grainednoise so it was easy fix after all, I've made this your way with weakref. I'll make bugfix release soon.

Also to address some of your objections:

All instance state should be in the instance itself, not only to avoid surprising the user, but also to avoid problems with serialization (pickling, RPC libraries).

Well, I think descriptors rocks - it was one of main assumptions and consequences - well, you can always access the data, so nothing for user to be surprised and you can always serialize entity to JSON - there is no magic when you'd like to do pickling or send through RPC, but it was not meant to, just use serialization method. Besides no unnecessary magic (which means you need explicite cast instance to JSON before sending it to api) is nice and I think 'python way'.

Conversely, having state in a data descriptor is usually not a good design pattern. This point might warrant a discussion all of its own, but to name a practical problem: how are you going to inspect that state directly during debugging?

Descriptors are transparent, so I don't see any problem. Having state in descriptor is not a bad decision (especially after this fix ;) ) because - why not? Thanks to descriptors we can instantiate field while making declaration in code - so we can instantly validate initial constraints and raise error in line of definition (which is useful for debugging). Also this doesn't require magic in metaclass neither in init to instantiate all fields each time (so less memory).

Not all objects are suited to be used as keys in dicts. In general, you should avoid using mutable objects as keys here. Implementing something as seemingly innocuous as an eq on an instance could result in that instance seemingly disappearing from the dict. Now, that'll make for a fun debugging session...

This would require overloading methods to return different id and/or hash - and if you are doing that there is always risk, you need to know what you are doing, especially if you are messing with objects that inherits from 3rd party lib classes. On the other hand this makes some really interesting opportunities when dealing with instances with some identity assigned, some "semiborg" pattern could be used. But anyway if you are doing such things - I can't do anything to prevent someone from breaking the code :)

In my opinion, the best approach would be to use MetaClasses here in order to know what name will be used for the field. Knowing that name will make it possible to either create unique attribute names, or to use those names directly as keys in a dict on the instance itself.

This would create memory leak problem again and would make unnecessary bond between field and name - for now I didn't find any case where field would require it's name, so why forcing such constraint without good reason?

Anyway thanks for help :)

beregond commented 8 years ago

@grainednoise fixed in 2.1.2, thanks :)