laravel / horizon

Dashboard and code-driven configuration for Laravel queues.
https://laravel.com/docs/horizon
MIT License
3.85k stars 645 forks source link

Jobs seems to be picked up by multiple worker #315

Closed azrulamir closed 5 years ago

azrulamir commented 6 years ago

I am running horizon on horizontal scale setup with multiple server running a same instance. I found that once in a while there are jobs that's being processed multiple times and somehow passed my constraint checking.

I suspect this could be cause by that job being processed by multiple workers at the same time, hence it manage to pass the constraint check I have on the job handle() function.

So my question, is horizon are fully compatible to run on horizontal scaling setup?

halaei commented 6 years ago

As far as I know, it is impossible for a job being picked by 2 workers at the same time. There could be a chance that a job takes longer than config('queue.connections.redis.retry_after') seconds and get retried by another worker.

So my question, is horizon are fully compatible to run on horizontal scaling setup?

I don't know, but can you describe what is your setup? Does each instance have it's own redis server? If yes, how are they connected to each other?

azrulamir commented 6 years ago

There could be a chance that a job takes longer than config('queue.connections.redis.retry_after') seconds and get retried by another worker.

From the recent jobs section in horizon dashboard, it doesn't show any jobs that took longer times to process, all of them are running below 1 second runtime. But after viewing this issue here #158, i increased the retry_after settings to 900 but still the issue is there.

I don't know, but can you describe what is your setup? Does each instance have it's own redis server? If yes, how are they connected to each other?

Each instance are connecting to a central/single redis-server instance. I can see from Horizon dashboard that it can detect all of the running instance.

My application is a simple voting api system that collects vote requests from client and process it via laravel queue horizon. Each job requests has a validation logic inside the handle() function but with the duplicates happening, it somehow doing one single request multiple time simultaneously. Appreciate your help.

Donny5300 commented 6 years ago

Did you set the settings correctly? You can choose at:

return 'environments' => [
            'local' => [
                'supervisor-1' => [
                    'connection' => 'redis',
                    'queue'      => [ 'default' ],
                    'balance'    => false,
                    'processes'  => 3,
                    'tries'      => 3,
                ],
            ],
        ],

https://laravel.com/docs/5.6/horizon

See the section Configuration

azrulamir commented 6 years ago

Below are my current configurations :

'production' => [
            'Voting-Engine' => [
                'connection' => 'redis',
                'queue' => ['default'],
                'balance' => 'false',
                'processes' => 1,
                'tries' => 1,
            ],
        ],
halaei commented 6 years ago

So you have multiple php servers connecting to a single redis server. Here are 2 causes that come into my mind:

  1. The servers must be synced in time, e.g via NTP. Otherwise, jobs reserved by one server might be considered expired by another.
  2. If servers are synced, there could be a chance that you are detecting duplicates due to a reason other than jobs being picked up by multiple workers. For example, maybe your database transactions are not in the correct isolation level and you push duplicate jobs into the queue.

Other than that, it seems to me than picking a job twice is impossible.

toopay commented 6 years ago

@azrulamir What is your retry_after values on your config/queues.php? If you set big number, you may need to also set timeout in the config/horizon.php just few seconds behind that number.

I have proposed to expose timeout value along with middleware on #319

emielmolenaar commented 6 years ago

Are you sure you are not simply dispatching the job multiple times, for example by events or observers?

andreladocruz commented 6 years ago

Why don't setup the retry_after in horizon.php configuration file?

This is my configuration today:

'production' => [ 'default' => [ 'connection' => 'redis', 'queue' => ['default'], 'balance' => 'simple', 'processes' => 1, 'tries' => 3, ], 'marketplace' => [ 'connection' => 'redis', 'queue' => ['marketplace'], 'balance' => 'simple', 'processes' => 1, 'tries' => 3, ], 'autoresponder' => [ 'connection' => 'redis', 'queue' => ['autoresponder'], 'balance' => 'simple', 'processes' => 1, 'tries' => 3, ], 'clicks' => [ 'connection' => 'redis', 'queue' => ['clicks'], 'balance' => 'simple', 'processes' => 2, 'tries' => 3, ], ],

The retry_after should be configurable by queue, right?

driesvints commented 5 years ago

As explained by others above the problem is probably using the same redis queue on multiple apps. It should be impossible for Horizon to process the same job twice. If you still believe this to be a problem please provide more details and preferable the configuration of your app(s).

dpde commented 5 years ago

What about having multiple worker servers that process the same queue? I guess I observed that jobs have been proccessed multiple times.

My setup: 1 webserver, 1 redis server, 3 Worker servers that process the same queue.

sergiq commented 5 years ago

Maybe this is closed but since I use horizon (a week ago), time to time, some jobs are processed twice at the same time. Also I know that a screenshot it's not an evidence but I can't reproduce it (only appear time to time) image

I have 2 backends with 1 worker each listening the same queue

Btw, with normal laravel workers (I've used it for more than a year) I have never had this problem.

halaei commented 5 years ago

@dpde @sergiq Are the followings helpful:

  1. Make sure different applications that are supposed to be isolated doesn't share the same redis database.
  2. There is a chance that a job takes longer than config('queue.connections.redis.retry_after') seconds and get retried by another worker.
  3. [If you have multiple physical servers] The servers must be synced in time, e.g via NTP. Otherwise, jobs reserved by one server might be considered expired by another.
  4. There is a chance that you are detecting duplicates due to a reason other than jobs being picked up by multiple workers. For example, maybe you push duplicate jobs into the queue, maybe because your database transactions are not in the correct isolation level.
  5. Horizon and Laravel Redis queue driver doesn't support Redis clustering.
sergiq commented 5 years ago

thanks @halaei for your answer, but:

  1. there is only one database with one app
  2. the job is processed at the same time (i had both timestamps when the job starts and both are the same second)
  3. both servers has NTP ofc
  4. No, the job event only throw one log (one job)
  5. Yes, i know it.

Also never happened with the normal laravel workers (maybe in more than 1 year).

halaei commented 5 years ago

@sergiq Can you explain your screenshot? What are the rows and columns and how does it show jobs are picked twice?

sergiq commented 5 years ago

The screenshot is small part of a job information (uuid). this picture shows when the job is executed (the first field is just an auto increment and the last one is who performs the action).

At this moment I removed horizon and started using again the laravel queue in production environment. I gonna investigate more deeper what's going on in the code and why.

halaei commented 5 years ago

Horizon redis queue driver is just a wrapper for laravle redis queue driver that actually doesn't touch queues but fire some events for monitoring. It also changes the job IDs to incrementing integers. I don't see how possibly these changes can cause jobs being handled twice. To me the screenshot doesn't reject 2 jobs being simultaneously handled by 2 workers. It would be a lot helpful if you could provide more information.

sergiq commented 5 years ago

Hi @halaei thanks for your response :) As I mentioned, a team gonna set up the same environment in dev as we have in production and try to detect when and why. I changed this morning the workers in production to use the laravel's one and everything is working ok (but it's to early to assume it). Let's see if have the same behavior that have in the last year with 0 duplicateds

trip-somers commented 5 years ago

This is almost definitely happening to me but only on my Homestead instance and not on my production server. I'm going to try updating Homestead, since it's only happening there.

For anyone interested, here's my list of symptoms:

I know this isn't the best way to "prove" that two workers are picking up the same job at the same time, but there's not a lot else that makes any kind of sense.

As stated above, this is only happening on my Homestead, and it only started about 2 weeks ago after I upgraded from Laravel 5.6 to Laravel 5.8 (and Horizon from 2.x to 3.2.x). My Homestead and production servers are both running off the same composer lock file with Laravel 5.8.18 and Horizon 3.2.1. My Homestead is a little older, so I am going to attempt to upgrade that.

trip-somers commented 5 years ago

Updating Homestead seems to have corrected the issue for me, whatever it was.

bkuhl commented 3 years ago

I'm also seeing this issue on my production environment. The queue jobs have additional checks in place to ensure the job isn't processed more than once, but because 2 workers have picked up the same job at the same time, the database record to ensure the job doesn't get processed twice isn't obeyed because it hasn't yet been written by either job. This isn't a new issue and has been happening for a long time.

Horizon v5.7.3

Here's some of my queue config:


                'balance'       => 'auto',
                'processes'     => 7,
                'tries'         => 0, // indefinitely

                // 5 seconds under the queue's retry_after to avoid overlap
                'timeout'       => (Carbon::SECONDS_PER_MINUTE * 10) - 10, // Just under 10 mins
graemlourens commented 3 years ago

@bkuhl We run 300 jobs / second on our stack and have never had this issue. It highly depends on how you have setup retry_after on connection and timeout etc on your jobs.

What is your retry_after setting on the queue connection to redis? this HAS to be slighly longer than the longest running job on the queue and has to be also slighly longer than the timeout that is set.

bkuhl commented 3 years ago

Our queues operate in bursts and may be loaded with 500k jobs with 11 concurrent workers. We only experience this issue when the queue gets initially loaded with jobs. Once there's plenty of jobs on the queue it doesn't happen. Average execution time on the jobs being duplicated is 2-3 seconds.

The connection's retry_after is set to 10 mins atm, which is 10 seconds longer than the timeout.

graemlourens commented 3 years ago

@bkuhl how sure are you that you're not by mistake actually indeed dispatching certain jobs multiple times? are the job id's actually the same or different?

bkuhl commented 3 years ago

That's a good question, and one that I should have investigated more before posting here. I'll have to do some digging on that and get back to you, I'm only assuming that's the case.

graemlourens commented 3 years ago

@bkuhl sure, get back if you have any 'hard evidence' (which sometimes is extremely hard to obtain!). Happy to help. We do a lot of logging, for example you can register a callback with Queue::before() and log the ID oft he job somewhere in order to see if actually this is the case that workers are picking up the same job with the same id (which i doubt to be honest).

bkuhl commented 3 years ago

I can confirm that some "duplicate jobs" on my end have unique ID for each job, suggesting it's not the workers picking up duplicates, but an issue on the pushing side.

trip-somers commented 3 years ago

If you figure out what's causing the extra pushes, please come back and let us (at least me) know what it was. I've never been able to figure it out, but I would suspect is has something to do with the communication between Laravel and Redis.

Has anyone in this thread experienced this with a different queue provider?

graemlourens commented 3 years ago

@bkuhl we're one step further! :) I'd also be interested to hear if and when you figure out what is causing it. We have never experienced this and have bulk-pushed up to 500'000 jobs sometimes in one go (parallel), which never led to duplication that we know of.

bkuhl commented 3 years ago

I'm going to remove my other comments to keep this thread clean, but I've confirmed it was a logic issue on my end :-\

trip-somers commented 3 years ago

@bkuhl After reading your now-deleted comment yesterday, I gotta know what the logic issue was, lol

bkuhl commented 3 years ago

Haha, With Laravel 8 I had upgraded to use batches for these jobs, and I was supplying some jobs when the batch was created, but also re-adding those same jobs via $batch->add() in certain conditions, leading to a handful of jobs being added twice. Definitely a 🤦 moment, but entirely my fault.

trip-somers commented 3 years ago

Dangit. You were my big hope for finally diagnosing why this happens to some people. Thanks for explaining.

bkuhl commented 3 years ago

Ah, sorry. I'm still monitoring things for the next few days so maybe something will come to light. Here's my troubleshooting steps in case this may help folks...

I noticed the UUIDs on my jobs were unique. At first I thought it was retries, but noticed all attempts were 1, so I was able to rule that out. Secondly, I wondered where the UUIDs were coming from, so I ended up adding logging to where the UUID was generated here. This enabled me to confirm that multiple jobs were indeed being fired by my app's logic. To figure out when/how I modified this code to conditionally throw exceptions to expose the stack trace (could also use xdebug or more extensive logging) enabling me to see what was causing the events.

adddz commented 3 years ago

Is this issue still happening? I'm about to move my setup to a multiserver setup with multiple worker servers. Once I run Horizon all the worker servers, how will I be able to access the /horizon UI web page from the app servers are behind a load balancer? The worker servers don't have any access to HTTP, they just run jobs.

bkuhl commented 3 years ago

No, I mentioned in my last comment this was an issue in my app's logic.

adddz commented 3 years ago

No, I mentioned in my last comment this was an issue in my app's logic.

Yes I’ve read that. But how are you able to access the Horizon URL when Horizon is running only on the worker servers?

bkuhl commented 3 years ago

I'm happy to help, but this discussion is completely off-topic for this thread so you should post on StackOverflow.