Database connections doubling when upgrading to 5.18

SamuelWillis commented 4 months ago

Horizon Version

>= 5.18

Laravel Version

10.48.16

PHP Version

8.3.7

Redis Driver

Predis

Redis Version

Predis 2.2 & Redis 6.2.14

Database Driver & Version

No response

Description

Upgrading to any Horizon version starting with 5.18 causes a doubling in the number of database connections and memory usage.

Reverting the Horizon back to version 5.17 brings the number of database connections and memory usage back down.

After looking through the Changelog and the associated changes I have been unable to see anything that could cause this besides this autoscaling tweak that could be causing odd scaling behavior.

Here is a snippet from our configuration:

            'default-worker' => [
                'connection' => 'redis',
                'queue' => [
                    'default',
                    'quick',
                    'poll',
                    'quick-embedding',
                    'poll-embedding',
                    'assets',
                    'notification',
                    'metrics',
                ],
                'balance' => 'auto',
                'minProcesses' =>  12,
                'maxProcesses' =>  18,
                'tries' => 1,
            ],
            // This conf would be impacted by the commit linked above
            'metrics-worker' => [
                'connection' => 'redis',
                'queue' => [
                    'metrics',
                    'quick',
                    'default',
                    'poll',
                    'assets',
                    'notification',
                ],
                'balance' => 'auto',
                'minProcesses' => 1,
                'maxProcesses' => 2,
                'tries' => 5,
            ],

We are running horizon as a k8s pod with multiple replicas, for a bit more context.

Steps To Reproduce

I haven't been able to nail down super concrete replication steps but we have seen this consistently when attempting to upgrade from version 5.17 to a version at or above 5.18.

driesvints commented 4 months ago

cc @PrinsFrank do you might know what's going on here?

taylorotwell commented 4 months ago

We would need concrete confirmation of what change causes it.

PrinsFrank commented 4 months ago

I think I see the issue here:

The "minProcesses" key defines the minimum nr of processes per queue
The "maxProcesses" key defines the maximum nr of processes in total

This is not clearly defined anywhere, but we experienced the same problem running on k8s with many queues. I'll add some documentation for this.

There previously was an issue where if the number of queues multiplied by minProcesses was higher than the maxProcesses key, the autoscaler would scale down to 0 when a queue was silent for a while, never scaling back up. See #1289

Now because you have 8 queues, and the minProcesses is set to 12, it should according to the configuration start 96 processes. But, your maxProcesses is set to 18. Instead, it shares those 18 processes equally across the queues, resulting in 2 processes per queue, where in your previous scenario the queue would scale back to either 1 or 0 depending on the load.

The memory footprint is going to increase per process in proportion to how many non-deferred serviceProviders and singletons you have, so an increase in services will increase the memory consumption.

What will solve your issue here: decrease the 'minProcesses' to a sensible value. If you want to always have 1 process per queue, set minProcesses to 1. Or, if you want to make sure you are scaling more efficiently based on queue load and available resources, group together queues based on job runtime and priority. This is the road we went for, instead of having n queues where most of the queues are always empty and we have to reserve a process for it, make sure that all the high-prio jobs end up in 1 queue with a lot of scalibility, and put other jobs in a low-prio queue.

If you want to scale even further: The json endpoints return all this info in a parsable format. You could run one supervisor per pod and parse the output of the json endpoint, and spin up further pods when the nr of actual processes hits the maximum processes.

Let me know if I can clarify anything else!

SamuelWillis commented 4 months ago

Thank you for the great explanation @PrinsFrank.

I had the understanding that the minProcess and maxProcess were both in total and defined the lower and upper process count limits for the worker.

We've grouped our queues into priority based queues already which is nice so I think all we will need to do here is adjust our configuration to have the right amount of scaling for each worker.

driesvints commented 4 months ago

Thanks all for your help on this one 👍

laravel / horizon