plone / guillotina

Python AsyncIO data API to manage billions of resources
https://guillotina.readthedocs.io/en/latest/
Other
188 stars 51 forks source link

Postgresql driver improvements #991

Open jordic opened 4 years ago

jordic commented 4 years ago

Serve this as a discussion for some proposals to the postgresql datastore driver:

  1. After some profiling, we found that most of the time is spend on traversal, (it's clear, it needs to fetch every segment on the path, do the security assertions for it, and continue for the next one. Usually, frequent paths are on internal cache, but, thinking a bit on it, it could be good idea to just retreive it in a single round trip (to the db) and do the security assertions then. After playing a bit with our schema, found a way, of retreiving paths in an efficient way:

https://gist.github.com/jordic/47502dd92e4b4b32a5b7751bb60533cc

  1. Make IWriter and IReader (for the datastore) serializable without using pickles. That's the only thing I personally don't like around guillotina, using a python only marshaller, makes things less open.

  2. A bit related to the first one, extract security information to a custom column. (could be a json field), something like this will allow us to improve the speed of traversal, and also, could be used on the postgres search feature.

  3. Use the pg_notify functionality to trigger the invalidation notices on objects, and propagate them to internal instance caches.

masipcat commented 4 years ago

Use the pg_notify functionality to trigger the invalidation notices on objects, and propagate them to internal instance caches.

I think this should be implemented outside the pg storage, like the RedisDriver https://github.com/plone/guillotina/blob/master/guillotina/contrib/redis/driver.py. I can't see where's defined the "IPubSubDriver" interface, but should be something like:

async def subscribe(self, channel_name: str, rid: str, callback: Callable[[str], None]):
async def unsubscribe(self, channel_name: str, req_id: str):
async def publish(self, channel_name: str, rid: str, data: Any):
vangheem commented 4 years ago
  1. we could consider parent objects has "shadow" like objects in this kind of traversal scheme. If we are serializing permissions to pg somehow, maybe the query to get all parents should not get the "pickle" data and just the zoid, id, parent_id and security. It would break __parent__ APIs so we'd need to change how parent pointers were used.

  2. I agree. We could experiment with others. (there is even json pickle if we want to be lazy)

  3. agreed. Should it be column? Or some sort of table relation? Should we just bite the bullet and just require pg at this point and not try to be "pluggable"?

  4. pg_notify is neat that we don't need another system. I guess this can be "plugged" into like the other pub/sub implementations. It would be nice to have a pubsub implementation that was using pg.

masipcat commented 4 years ago

About the first point @jordic , I suggest to start adding a method to the storage interface to load the objects given a path. With this change, we can reimplement navigate_to in constant-time. When we validate the performance improvement maybe we can change the traversal too