jazzband / django-ddp

Django/PostgreSQL implementation of the Meteor server.
MIT License
167 stars 29 forks source link

MySQL support #24

Closed fmalina closed 8 years ago

fmalina commented 8 years ago

Please collect ideas and links under this issue here. I've heard of Facebook's fork of MySQL with listen/notify support. Any links? PostgreSQL docs say that listen/notify are not standard SQL. Is there a way to implement live in a database independent manner.

tysonclugg commented 8 years ago

The bigger part of the issue is not that PostgreSQL provides LISTEN and NOTIFY, but that those SQL extensions are backed by psycopg2 which provides asynchronous connection support allowing django-ddp to avoid polling PostgreSQL for changes.

I've just had a very quick glance through https://dev.mysql.com/doc/connector-python/en/ but I couldn't find any options for async support.

fmalina commented 8 years ago

MySQL Connector is not my adapter of choice, that would be mysqlclient https://pypi.python.org/pypi/mysqlclient (a fork of MySQL-python, C ext) or PyMySQL https://pypi.python.org/pypi/PyMySQL (pure Python).

MySQL has rather good replication and MongoDB replication log (OpLog tailing) is what Meteor uses for livequery. Clean solution will likely come from understanding what the replication mechanisms have in common and where they differ.

Here's a meteor ticket regarding SQL support https://trello.com/c/Gf6YxFp2/42-sql-support, it's their biggest one. 2nd one with most votes has only half as much.

tysonclugg commented 8 years ago

tl;dr: Issue closed until MySQL bug #19597: listen, unlisten, notify is fixed.

The bigger part of this issue is that PEP 249 -- Python Database API Specification v2.0 doesn't even mention the work "asynchronous". More specifically, Python doesn't specify how database drivers should provide asynchronous support as part of PEP-249.

But here's the thing: As far as I know, there is no standard database API for Python which supports asynchronous operation. And it seems like there should be one. Maybe it's time for PEP-249 to be extended for an asynchronous API. Otherwise every database implementor is going to end up doing their own thing, and it sounds like there is a need for this sort of thing.

Now pay attention - the next paragraph explains the core mechanics of how django-ddp uses PostgreSQL asynchronous notification and psycopg2 asynchronous support to dispatch and receive data between front-end web servers

I've managed to create django-ddp because psycopg2 provides asynchronous support, and luckily Django has been using psycopg2 via the django.db.backends.postgresql_psycopg2 database engine for a very long time. First we make an asynchronous connection to PostgreSQL using psycopg2 and subsequently use the connection to execute the LISTEN command (non-standard SQL) and then wait for data to arrive which is sent via the NOTIFY command (more non-standard SQL).

As all of the psycopg2 asynchronous support and the LISTEN and NOTIFY SQL commands are non-standard (no async spec in the Python Database API Specification and no standard SQL for asynchronous event dispatch) I've been fortunate to get this far with PostgreSQL. It doesn't really bother me that django-ddp is PostgreSQL only though -- I've been using PostgreSQL since version 6.4 shortly after it was released in late 1998 (I started using MySQL about 3 years later). I started using Django (with PostgreSQL and psycopg2) in January 2007.

On a related side note, you've heard that PostgreSQL is Web Scale right? This is a talk by the first DBA for Skype, he really knows his stuff (not so good at putting together presentations, but the technical content is amazing). Jump to 10:55 in the video for the performance comparison between memcache, Redis, PostgreSQL and MongoDB.

All of the above doesn't really help for anything other than PostgreSQL though, so now let's look at what we need for MySQL to achieve this in an efficient manner with the smallest number of moving parts that we can get away with. To make myself clear, I don't want to introduce any more components (eg: crossbar.io, reddis, ...) - django-ddp is designed for massive concurrency when scaling out to serve thousands of clients, while remaining relatively simple and robust.

First, we'll need the equivalent of LISTEN and NOTIFY in the MySQL server itself. MySQL has a feature request for exactly this in MySQL bug #19597: listen, unlisten, notify which was opened in May 2006 (just over 10 years ago). Given the current situation, I'd hazard a guess that MySQL will never get LISTEN and NOTIFY -- their developers are probably too distracted by newer, shinier feature requests. :disappointed:

OK, given that MySQL doesn't have LISTEN/NOTIFY, maybe your suggestion of "tailing the oplog" is the only way we're going to get push notifications after transactions have been committed and changes are written to disk. Aesop has a nice description of how the "MySQL change event producer" works which seems quite sane, but sadly fails to be consistently realtime due to limitations of the log mining approach:

The log mining approach has limitations if data is distributed across database tables (when updates are not part of a single transaction) where-in it is hard to correlate multiple updates to a single change. It also has limitations when data is distributed across types of data stores - for e.g. between an RDBMS and a Document database. Aesop addresses this problem by implementing a "Pull" producer that uses an application-provided "Iterator" API to periodically scan the entire datastore and detect changes between scan cycles. This implementation is based off Netflix Zeno.

So who cares if notifications don't go out straight away because of some weird corner cases, can't we just press on regardless? Well we could, but a few problems arise:

  1. The corner cases are actually very common. For example, any Django model using ManyToManyField will cause data to be distributed across database tables.
  2. The work-around of running a "Pull" producer means that some changes will be missed. And sooner or later someone will notice.
  3. The pull producer will run into issues when databases get big, as it will consume more resources and take longer amounts of time to traverse all the tables when checking for changes. Even the base load that the constant polling puts on the server and network infrastructure makes me cringe - those CPU cycles and IO operations would be better used running your application rather than being wasted. This wastage not only negatively affects our application, but it has a very real impact on our planet -- wasted energy and resources have a direct impact on global warming!

I don't want to disparage you, I'd love if MySQL support could be added to django-ddp in a manner that doesn't violate what I'm trying to achieve -- a reactive, realtime, relational and robust web service backend that is also responsible with regard to the environment.

I'm closing this issue with wontfix until the situation changes (eg: progress on LISTEN/NOTIFY support) - you're welcome to continue the discussion with further comments as you see fit -- who knows, maybe you can convince me to re-open this...