msiemens / tinydb

TinyDB is a lightweight document oriented database optimized for your happiness :)
https://tinydb.readthedocs.org
MIT License
6.82k stars 549 forks source link

Refactoring Elements #105

Closed neRok00 closed 8 years ago

neRok00 commented 8 years ago

I was writing a package to build a simple graph on top of TinyDB, and came into a problem similar to #97. After reading the TinyDB source code, it isn't really possible (or at least comes with a huge performance penalty).

At the moment, whilst you can replace the Table class used by the TinyDB class, you have no control over the Element or StorageProxy classes. Instances of StorageProxy are created by the TinyDB class, and passed to the Table class. The StorageProxy then creates Element instances, with no way of changing it.

If I were to extend Table and replace the _read method with my own in order to use my own objects, there would be quite a performance penalty. The reason being, JSONStorage by way of using the json package has created dict objects for every element, then StorageProxy has gone and replaced each of those dicts with an Element object, and then I would replace them again with my own. Additionally, whilst I could pass my object to the json module with the object_hook argument, this object would then be replaced by the StorageProxy anyway, defeating the purpose.

The way I see it, the only purpose of the Element class is to give the element the eid attribute. I think this can be acheived in 2 ways;

  1. It would be possible to create a custom json decoder, based upon json/simplejson. I have looked over the source code for those packages, and it would be possible to maintain a 'breadcrumb' list as the elements were decoded, and thus the eid would be known as the element is created. In this instance, the element object to create could be defined by the object_hook argument. The problem with this approach is that you couldn't use a different json package, as they would not have the capability of determining the eid as they decoded. For this reason, I don't think this is a good option.
  2. Make the eid a field on the element. Then, any json package will work, and the object_hook argument that a lot of json packages seem to implement would work too.

I think option 2 opens up a lot of possibilies. The 'default' Element used by TinyDB could just be the dict as per json, or a slightly extended version with __getattr__ that allows the eid to be returned as a property. An optional Element could allow all fields to be accessed as properties. And of course, people can pass in their own objects.

There's a lot more specific implementation details I could go into, but I'll let you respond to this first.

msiemens commented 8 years ago

Just for context: what information do you want to store on the element in addition to the eid? Or do you want it to to be something other than a dict?

Regarding customizing the Element class: I think we should retain the element.eid field at the very least for reasons of backwards compatibility, but also I think it's the most elegant way of accessing an element's id without potential conflicts with a actual eid dict field. That would be an argument against option 2.

In my opinion the cleanest way to customize the Element class would have a class attribute for the TinyDB or Table instance like ELEMENT_CLASS which would be used when creating element instances. This both maintains backwards compatibility in the general case and allows for the customization you described. As a trade-off this would require one more variable lookup for every element, but it should be neglible.

What do you think?

eugene-eeo commented 8 years ago

@neRok- a side note- breadcrumb lists can be very (space) expensive. Also a simple incrementing scheme over the dictionaries might not preserve the EIDs. And we'll also have to deal with nested dictionaries, etc. (a stack is one way to go but yea, it gets complicated) since object_hook is called for every dictionary (?). So the simplest way is the current one or your alternative suggestion.

@msiemens class attributes will be the way to go, IMHO. One can subclass from Element and add their own metadata properties. If we're talking about performance, a very good way to start is constant hoisting to reduce variable/attribute look-ups:

element_class = self.element_class
for key, value in elements:
    cache[key] = element_class(value)
msiemens commented 8 years ago

If we're talking about performance, a very good way to start is constant hoisting to reduce variable/attribute look-ups:

I know, but I don't think we need to optimize this as in the usual application this won't be the bottleneck in any noticable way anyway.

neRok00 commented 8 years ago

@msiemens, I wanted to build a simple graph, so basically elements will have bi-directional links to other elements. If the link is removed from one element, then it would also need to removed from the other, and this would be performed by the elements themselves 'automatically'. With the current arrangement, each element would 'transform' from dict > Element > GraphElement. That's 3 objects that have to be created per element just to get the one I want, when really if the code were refactorred it could go straight from json to my element.

If the goal is to maintain the eid attribute, then there seems to 3 ways to do so;

  1. Determined as part of the json parsing. I really don't like this option for reasons previously mentioned, but it needs to be on the table. The problems @eugene-eeo mentions could be overcome. It seems simplejson maintains a list of tuples of (key, value) pairs, so getting the right id is not a problem. And because we know the layout of the json file is a dict of tables, which are dicts of elements, by maintaining the breadcrumb list we would know when we are creating an Element, or a nested dict inside an element.
  2. Append it to the object retrospectively, as is currently the case. At the very least, this could be improved so that instead of creating every Element everytime that StorageProxy.read() is hit, perhaps the table dict could have its __getitem__ method overriden to do an isinstance() check?
  3. Have it as a field of the element. To solve the problem of conflicts, it could stored as something like __eid__ in the elements dict?

So here's where I'm going to throw a curveball, and suggest that the use of option 2 or 3 should be determined by the Storage class, and not the database itself. In fact, the way I see it currently, Element, StorageProxy and Table are all specific to the JSONStorage class, and are in fact a hindrance to implementing other storages. So to simply put it, I feel the Storage class should be responsible for returning an Element that has an eid attribute, and how it goes about this should be determined by the storage.

Unfortunately, StorageProxy is so intertwined with the Element class, that refactoring elements becomes a much bigger task (hence I didn't dump all this in the first post, in case you weren't open to the idea). So now we need to refactor everything!

Aside from creating the elements, StorageProxy is also responsible for sending the entirety of a table as dict to the storage on updates. This is also a problem for storages that might be able to write only certain portions of the database at once. In fact, you have already had a discussion on this topic in the following issue: #90

My idea is that instead of the database/table objects telling the storage when to read and write, they should tell the storage what to do (as in get this, update this, delete this - basically a REST interface). In practice, this would basically mean integrating the tinyrecord package into the core (as it achieves basically the same goals).

I think with this refactor, the basic API functionality that exists now will be the same, but it gives so much more flexibility with storages and elements. To put it simply, the TinyDB database class should tell the storage class what to do, not how to do it - and the storage class should give back element classes (and table classes?) with certain things expected of them (ie, an eid attribute).

To be brutally honest, if you really look at what TinyDB currently does, it pretty much loads a json file for me, and writes to it on occasion. I could do that myself with a few lines of code :/ I've seen in other issues that you don't want to integrate too much into core, but I think to get a truly flexible system, it needs certain minimum capabilites.

eugene-eeo commented 8 years ago

@neRok- a little off topic here, is the graph very big? Should it be able to fit into memory? Self promotion here, if space/performance is an issue then you should look into an embedded graph database. One such solution using SQLite is graphlite.

If you can fit the entire dataset into RAM then you should look into batching up operations using transactions on TinyDB to reduce IO. At least with a reduction in IO operations we reduce object allocation. One such library for that is tinyrecord which will give you atomicity guarantees.

neRok00 commented 8 years ago

I'll definitely have to take a closer look at your project for some ideas. My end goal is to create a new geneology 'app', and naturally graphs are a great fit for family trees. I have been using a proper graph database up until recently, but I've decided to move back to something simpler - and a json file will achieve this with the added benefit that anyone can read the file without drama. Hence I started looking at this project.

Regarding the size, I don't think it's unusual for people to have gedcom files (the current text based format) in the 50MB range. These files could easily contain 1,000 people, which could have 15 facts each, and each fact could have 1 source. So right there is 31,000 records, with at least the same number of edges.

eugene-eeo commented 8 years ago

I see. I've come across such problems before concerning mapping simple JSON maps to graph data structure. The main takeaway for me is:

But yea if you're working with one single JSON object it's safe to assume that all your data should fit in memory. Another alternative for visualisation is to use graphviz. That's a better fit for graph data anyways, but yea it doesnt invalidate the need for us to look into refactoring the codebase.

msiemens commented 8 years ago

First off, I totally agree that the current design of the TinyDB internals is quite messy. To my defense: the current architecture of Tables, Elements and StorageProxys grew organically over time. And even though it's far from elegant I think so far they served their purpose well enough.

Now regarding a possible refactoring:

If the goal is to maintain the eid attribute, then there seems to 3 ways to do so; [...] Append it to the object retrospectively, as is currently the case. At the very least, this could be improved so that instead of creating every Element everytime that StorageProxy.read() is hit, perhaps the table dict could have its __getitem__ method overriden to do an isinstance() check?

As long as we don't introduce some kind of caching we have to create Element objects for every item every time the database is read from the storage. Another option would be shifting the responsibility for creating the Element objects to the storage which (1) would be a breaking change and (2) would require "smart" storages opposed to the current design where storages are rather dumb (more on that later).

Have it as a field of the element. To solve the problem of conflicts, it could stored as something like __eid__ in the elements dict?

As far as I can remember I considered this but chose to not follow this path as it basically introduces redundant information because we already store tables as {id: {element ...}, id: {element ...}}. Also, this too would be a breaking change as the way to access the element id would be an item access, not an attribute access.

In fact, the way I see it currently, Element, StorageProxy and Table are all specific to the JSONStorage class, and are in fact a hindrance to implementing other storages.

I don't see how Element, StorageProxy and Table are specific to the JSONStorage. If they are, that's not what I intended. Could you elaborate in which way(s) these classes depend on implementation details of JSONStorage?

My idea is that instead of the database/table objects telling the storage when to read and write, they should tell the storage what to do (as in get this, update this, delete this - basically a REST interface). In practice, this would basically mean integrating the tinyrecord package into the core (as it achieves basically the same goals).

As far as I can see we have to choose between two competing designs here: smart storages (your proposal) and dumb storages (the status quo). Now I'm sure there are good arguments for both approaches. The reason I like the dumb storages way is that implementing a new storage is very easy. You just have to be able to store/load a dict and you can do that in whatever way you like. In the end I don't think one design is inherently superior to the other but I think that the trade-offs taken with the dumb storages are best in terms of usability.

To be brutally honest, if you really look at what TinyDB currently does, it pretty much loads a json file for me, and writes to it on occasion. I could do that myself with a few lines of code :/ I've seen in other issues that you don't want to integrate too much into core, but I think to get a truly flexible system, it needs certain minimum capabilites.

When I wrote the initial version of TinyDB the big innovation if you will was the query interface. In the end TinyDB is just that: a dict with a handy query interface. With that in mind I'm not sure whether or not TinyDB is the best base for handling graph structures.

A quick remark on breaking changes: Some of your proposals would require breaking changes which would affect a big share of users. As the popularity of TinyDB grows (>1000 GitHub stars) I'm becoming more and more conservative in that regard. That's not to say that breaking changes are ruled out but I feel like they should bring improvements to the majority of the userbase to be considered worth the upgrade.


Now back to the original question of customizing the Element class: I still believe the most straight-orward with the least amount of breakage for the users is introducing a class attribute for the TinyDB or Table instance like ELEMENT_CLASS. It's certainly not the best solution with regard to performance but performance is quite explicitly not a design goal of TinyDB.

msiemens commented 8 years ago

a little off topic here, is the graph very big? Should it be able to fit into memory? Self promotion here, if space/performance is an issue then you should look into an embedded graph database. One such solution using SQLite is graphite.

@eugene-eeo Quick note: your link to graphlite is broken, you're missing an L ;)

neRok00 commented 8 years ago

Okay, I'll start another issue regarding dumb vs smart storages, and perhaps we'll keep this one limited to discussing elements with the presumption that they will be used with dumb storages.

I think there definitely needs to be a kwarg on the TinyDB class for changing the Element during the database initialisation.

I also think it is desirable to get rid of the '2 step' element creation (dict > element). Unfortunately you can't pass Element to the json parser via object_hook, because nested dicts would also look like elements.

But perhaps a simple class like class AttributeDict(dict): pass can be sent via object_hook. Then the StorageProxy can just set the eid to the object, rather than creating a new object, eg for key, val in iteritems(raw_data): val.eid = key.

If this procedure were used, perhaps the element kwarg on the TinyDB class should be optional? If it is passed, then a new object is created, otherwise the above procedure takes place. The downside to the above procedure is that without a specific element class specified, the elements will appear as AttributeDict, not Element, affecting any isinstance checks people might rely on. I'm not sure if that is a concern though, because people should just pass an element kwarg if they want to guarantee that.

Also regarding the Element.eid property, it can currently be changed. Maybe this should be made immutable some how? Building on the above, perhaps something as simple as the following?

class AttributeDict(dict):
    eid = property(lambda self: self._eid)

# And later...
for key, val in iteritems(raw_data): val._eid = key

An alternative solution to the '2 step' problem might be for Element to be a proxy to the dict returned by the json parser. This way the creation of each element instance would be far 'lighter'. This could look like the following (ignoring the immutable eid property for now);

class Element(object):
    def __init__(self, eid, fields):
        raise NotImplementedError

class ProxyElement(Element, SomeGoodProxyClassFromAPackage):
    def __init__(self, eid, fields):
        self.eid = eid
        self.proxy = fields

By default. ProxyElement could be used by TinyDB class, and if users want to implement their own element class, they could do it like the following;

class MyElement(Element):
    def __init__(self, eid, fields):
        self.eid = eid
        for key, val in iteritems(fields):
            setattr(self, key, val)

Another improvement I can think of for elements is to store a reference to the table on the element, so users don't end up with random elements and not know which table they belong to.

I actually think StorageProxy can be removed entirely, and the functionallity moved into the Table class. Currently, tables don't actually know their own name! So Table.__init__() could accept and use the name arg, just like StorageProxy does. Additionally, Table._read() would effectively be replaced with StorageProxy.read(), and the table would handle creating elements (the table would also pass itself to the element at this point). This could then be leveraged to implement the lazy creation of Elements previously discussed.

The same would occur with the write and purge functions of StorageProxy - roll them into the table itself.

msiemens commented 8 years ago

I think there definitely needs to be a kwarg on the TinyDB class for changing the Element during the database initialisation.

Seems like we all agree on this change. Would you feel comforable with setting up a Pull Request?

I also think it is desirable to get rid of the '2 step' element creation (dict > element). Unfortunately you can't pass Element to the json parser via object_hook, because nested dicts would also look like elements.

The downside to this is that it only works with the JSONStorage. If we'd chose that route, then every Storage which doesn't have a mechanism similar to object_hook (e.g. the MemoryStorage we provide) would have to implement it by itself which in the end would result in more or less the same iteration based approach we already have.

An alternative solution to the '2 step' problem might be for Element to be a proxy to the dict returned by the json parser. This way the creation of each element instance would be far 'lighter'. This could look like the following (ignoring the immutable eid property for now);

I don't see how that's different from what we already do as the Element class is a dict proxy with only adding the eid attribute. Or did I misunderstand you here?

I actually think StorageProxy can be removed entirely, and the functionallity moved into the Table class. Currently, tables don't actually know their own name! So Table.init() could accept and use the name arg, just like StorageProxy does. Additionally, Table._read() would effectively be replaced with StorageProxy.read(), and the table would handle creating elements (the table would also pass itself to the element at this point). This could then be leveraged to implement the lazy creation of Elements previously discussed.

Well, the reason for the existence of StorageProxy is that I first started out with the TinyDB class and then later added Tables which used the TinyDB _read/_write methods to get access to the storage. But this required the tables to always keep a reference to the master TinyDB instance. To solve this commit b79ce35addbdcd2d368e710d7149d009eafd7834 introduced the StorageProxy class which is created by the TinyDB instance and passed to the table. The StorageProxy then is responsible for extracting the table contents from the database dict and writing the data back to the right place. What I like about this approach is that it makes it hard to mess with tables in a way that would result in data being written to a wrong place.

msiemens commented 8 years ago

@neRok00 Are you still interested in customizing the Element class?

neRok00 commented 8 years ago

Afraid not, as I have no incentive to as TinyDB is not suitable in it's current guise for my use case.

msiemens commented 8 years ago

Okay, I'll close this issue then. If anyone wants a customizable Element class, I'd gladly accept pull requests.