neurocracy / omnipedia

The Omnipedia web app.
https://omnipedia.app
GNU General Public License v2.0
3 stars 0 forks source link

The quest for a more intelligent queue system that supports prioritizing, parallel processing, etc. #3

Open Ambient-Impact opened 1 year ago

Ambient-Impact commented 1 year ago

We currently use a relative simple ReactPHP script that just runs several queues on App Platform, which works okay for things that aren't horribly time sensitive, but will become more of a problem when we need to do anything time-sensitive via the queue - for example, sending email asynchronously like in #2

Additionally, we'll also probably want to implement some limits for the number of queue items and time to run in each queue at once, but this is not as simple due to the fact that the Warmer module's Drush command (which allows running specific Warmer plug-ins) does not provide options for time and item limits, and if we want to use the Drush queue:run command which does provide item and time limits, the Warmer module only seems to expose a single warmer queue, so if other items have already been added from other Warmer plug-ins, we have no way to prioritize the more urgent one and have to run them all.

Symfony Messenger

The current goal is to port everything to Symfony Messenger + Drupal: Realtime Queues and Cron module which is maintained by @dpi / dpi on Drupal.org; he's written a series of blog posts on implementing the Drupal module and how to make use of it, with the following being especially relevant to us:

As an added bonus, it seems to support taking over processing of Drupal core's @QueueWorker so that we can run various queues (such as the Warmer module's) without them having to have explicit Symfony Messenger integration.

Additional links

ReactPHP

Despite its name, this has nothing to do with the React JavaScript framework but is a PHP framework to run longer tasks asynchronously, in parallel, and supports very useful things like Promises. Matt Glaman has been using this to run multiple, parallel background tasks invoked via Drush on DigitalOcean's App Platform (!):

Other ReactPHP queues

Advanced Queue module

This seems to have a nice UX compared to the standard Drupal queue stuff and does support prioritizing queues, can be invoked via Drush commands, and is used by Matt Glaman in the posts above.

ergonlogic commented 1 year ago

TL;DR distributed systems are harder than they look at first.

I have some prior experience with using PHP-based queuing in Aegir. Bear in mind this was years ago, on older versions of PHP, so YMMV.

That said, PHP is (or at least was) bad at long-running processes. Most PHP applications are web-based, and so follow a request-response model, where the PHP process is invoked by a CGI gateway (eg. php-fpm). In this model, the process usually dies once it has sent its response, and each subsequent request spawns a fresh process. As a result, any system resources used by the process are freed-up on an ongoing basis, by the very nature of this architecture.

Long-running processes, such as daemons, don't benefit from this resource release mechanism. As a result it is (or was) relatively easy for a PHP-based daemon to suffer from memory leaks, run out of file descriptors, etc. In Aegir's native queued, we had to regularly fork a new process, leading to unfortunate log messages such as "waiting for children to die." :roll_eyes:

Because the tasks Aegir was running were also very resource intensive (eg. backups streaming site files and a database dump into a gzipped tarball), we'd find ourselves suffering from I/O contention that sometimes led to timeouts on the front-end. In fact, there was a period where we'd regularly crash the server by trying to run multiple such process in parallel. We ended up having to ensure that such tasks were processed serially, so as not to overwhelm the server. We also ran the queue daemon under nice and ionice, so that the web server's processes would get priority access to memory and CPU cycles.

This madness eventually led to Hosting Skynet, which implemented a simple daemon in Python that polled the task queue (a database table), and then ran the appropriate Drush command to trigger the task in a sub-process. We still ran the daemon under nice/ionice, but Python doesn't suffer from memory leaks and recycles its file descriptors.

The success of Skynet eventually led to the full re-architecting of the queue system in Aegir5 on top of Celery/RabbitMQ. We re-used that foundation in Rugged's multi-worker package release workflow. While this system is working quite well, it has been challenging to test efficiently without running into race conditions, etc.

Anyway, most of the above probably isn't relevant to Neurocracy, per se. But maybe it can act as a cautionary tale about some of the challenges with distributed systems. However, William Brander does a much better job than I ever could in this video: Top 5 techniques for building the worst microservice system ever

Ambient-Impact commented 1 year ago

Oh yeah, even three or four years ago, PHP was not practical for running daemons and other long running processes. I suspect it's improved a lot over the last few years in terms of not resulting in memory leaks and other issues, but it must have been a nightmare in the Aegir3 days. You can get a sense of where it's at nowadays from the links in the issue summary to ReactPHP and Matt Glaman's blog posts (he's at Acquia).

Here's the last 24 hours of our background process worker (which runs in a separate container from the web containers), and other than some dumb issues I've caused by naively scheduling queue runs without checking if one is already in progress where it runs multiple parallel expensive queues (very likely where you see the CPU and memory graphs spike really badly), it actually seems to run okay for our content warming and cron:

Screenshot 2023-07-16 at 14-00-43 omnipedia-production - DigitalOcean App Platform

That kind of tripping over itself is what I want to fix today so it only ever runs one expensive queue at a time because whoops.

Ambient-Impact commented 1 year ago

Alright, so I kind of just went all out on this and hopefully solved a number of issues. I'd like to eventually refactor this into something more reusable and configurable.