Relational sets - Githubissues

jimfulton commented 7 years ago

Lean on PostgreSQL to implement sets.

Objects grow _newt__parents__, which is an array of containers. When a container fetches its contents, it does so by querying objects which has it as one of their parents. Of course, the GC has to be aware of this. An object without references to it would still be non-garbage if it has any non-garbage parents.

Containers "store" their contents through these references. Containers have a length. Container conflicts are always resolvable, because they behave like lengths.

A variation on this is approximately ordered collections. Collections have min and max positions that behave like zope.min-max. When an object is added at the front, the min position is decremented and used for the object position. Similar at the back. The position becomes part of the parent pointers. Positions are not unique.

jimfulton commented 7 years ago

I guess I should mention what itches this is scratching.

This potentially provides conflict-free containers (assuming that we don't somehow end up with some sorts of conflict-like behaviors on Postgres indexes).
It provides a maybe cleaner way to deal with PSets, which are sets of persistent objects that need not be orderable. This gets complicated because PSets based on BTrees require that items have oids, whcih can be awkward for new objects.
We often want to maintain forward and backward references between parents and children, which is a DRY violation, on some level. This allows us to push the refs from parents to children down to Postgres indexes, which are derived, and presumably faster.
The last part, which is almost unrelated is an idea for something like a scalable persistent list, which is interesting for implementing queues, or perhaps for dealing with things like "news".

jamadden commented 7 years ago

Half (or less) baked idea: I wonder if this might just be a fundamentally differently managed type of object. Instead of treating it as a Python object with a pickle and this weird ( 😄 ) attribute and indexing behaviour, what if it was treated more like a SQLAlchemy-style object backed by a table?

Container OID	Contained Object OID

A Connection subclass would recognize a Persistent subclass as this type of object and just go through a different code path (direct SQL) when it came time to load and save it. A sufficiently smart implementation could treat iteration as a series of SQL queries over a very large table if needed. Ordering and uniqueness would be determined by the class of the object and hence the table layout and indexes it was stored in. Interaction with the persistent object cache would be the trickiest part, but that could be seen as a benefit.

jimfulton commented 7 years ago

On Thu, Mar 16, 2017 at 10:46 AM, Jason Madden notifications@github.com wrote:

Half (or less) baked idea: I wonder if this might just be a fundamentally differently managed type of object. Instead of treating it as a Python object with a pickle and this weird ( 😄 ) attribute and indexing behaviour, what if it was treated more like a SQLAlchemy-style object backed by a table?

That's more or less what I'm suggesting. The mention of the length and min and max were decoys to an extend, but a key thing I'd want to keep is invalidation. The internal implementation would get items via SQL queries. Results would be cached in v attrs, so when the container was modified we'd know to refetch data. There are lots of variations on this theme.

BTW, I recently realized that the Newt search APIs should grow options to return the living, not just ghosts, which would mitigate the lack or prefetch in RelStorage in many cases.

Container OID Contained Object OID

A Connection subclass would recognize a Persistent subclass as this type of object and just go through a different code path (direct SQL) when it came time to load and save it. A sufficiently smart implementation could treat iteration as a series of SQL queries over a very large table if needed. Ordering and uniqueness would be determined by the class of the object and hence the table layout and indexes it was stored in. Interaction with the persistent object cache would be the trickiest part, but that could be seen as a benefit.

I'm not sure I follow this, but I think this can all be managed by the container object as easily.

Perhaps I'm missing some benefit of handling this at the connection level.

pauleveritt commented 7 years ago

That point about returning living instead of ghosts would be useful for our project. I don't know if you necessarily mean from the JSONB or having the pickles come back in the result. Either way, it would be nice to avoid potentially 20 more SQL requests to get the data needed to fill a query "batch".

Apologies if this is thread-jacking, but one complaint about SQL-backed traversal is, a different SQL query for each hop in the URL. Various schemes (e.g. the Kotti project in Pyramid) have custom traversers which do magical SQLAlchemy stuff to only generate one query. Is this problem is subclass of anything the ticket is scratching? I doubt it...I suspect it's a different topic.

jimfulton commented 7 years ago

On Thu, Mar 16, 2017 at 11:32 AM, Paul Everitt notifications@github.com wrote:

That point about returning living instead of ghosts would be useful for our project. I don't know if you necessarily mean from the JSONB or having the pickles come back in the result.

Having full pickles come back (optionally).

Either way, it would be nice to avoid potentially 20 more SQL requests to get the data needed to fill a query "batch".

Yup, although I anticipated implementing prefetch for RelStorage that would be a win in the case where multiple objects were being prefetched.

An advantage of prefetch over what I mention here is that prefetch can be smart enough to only fetch objects it doesn't have state for.

Apologies if this is thread-jacking, but one complaint about SQL-backed traversal is, a different SQL query for each hop in the URL. Various schemes (e.g. the Kotti project in Pyramid) have custom traversers which do magical SQLAlchemy stuff to only generate one query. Is this problem is subclass of anything the ticket is scratching? I doubt it...I suspect it's a different topic.

Yes, this is off topic. But I started it.

jamadden commented 7 years ago

I'm not sure I follow this, but I think this can all be managed by the container object as easily.

Perhaps I'm missing some benefit of handling this at the connection level.

In practice there may not be much of a difference. It just seemed like it might be less...invasive? or a better separation of concerns?...to have the connection handle this more directly, since it already knows about the pickle cache and has the database connection.

newtdb / db

Relational sets #14