laravel / horizon

Dashboard and code-driven configuration for Laravel queues.
https://laravel.com/docs/horizon
MIT License
3.87k stars 657 forks source link

laravel horizon jobs stuck in pending #1280

Closed khalidgxg closed 5 months ago

khalidgxg commented 1 year ago

Horizon Version

5.15

Laravel Version

10.09.0

PHP Version

8.1

Redis Driver

Predis

Redis Version

2.1.2

Database Driver & Version

No response

Description

Hi all,

I am writing to report a bug related to the job processing in Laravel Horizon. Specifically, I have encountered a situation where a job appears to be stuck in the "pending" status despite having a "completed_at" timestamp, and another job is stuck in the "reserved" status without progressing further.

Here are the details of the problematic jobs:

In the first case, the job has a "completed_at" timestamp indicating its successful completion, but it remains in the "pending" status. On the other hand, the second job is stuck in the "reserved" status without progressing further.

Could you please investigate this issue and provide guidance on how to resolve it? It seems that the job statuses are not being updated correctly, causing confusion in monitoring and processing.

Thank you for your attention to this matter. I look forward to your response and assistance in resolving this bug.

Best regards, khalid

Steps To Reproduce

  1. Set up a Laravel application with Laravel Horizon installed. Make sure you have the necessary dependencies and configurations in place.
  2. Create a job that exhibits the issue. For example, you can create a custom job class StuckJob that performs some task, such as writing to a log file or making an API request.
  3. Configure your application to use Laravel Horizon as the queue driver. Ensure that the queue connections and supervisors are properly set up.
  4. Push multiple instances of the StuckJob to the queue using the Laravel Horizon job queueing mechanism. You can use the dispatch() function or Horizon-specific methods like Horizon::queue() to push the jobs.
  5. Monitor the Horizon dashboard to observe the job processing. Keep an eye on the status of the jobs you pushed.
  6. Check if any of the jobs get stuck in the "pending" status despite having a "completed_at" timestamp. Note down the relevant job ID, connection, queue, and payload details.
  7. Repeat the process with another job to observe if any jobs get stuck in the "reserved" status without progressing further. Again, note down the relevant job ID, connection, queue, and payload details.
  8. Take note of the Laravel Horizon version you are using in your application.
driesvints commented 1 year ago

Hey there,

Can you first please try one of the support channels below? If you can actually identify this as a bug, feel free to open up a new issue with a link to the original one and we'll gladly help you out.

Thanks!

khalidgxg commented 1 year ago

ok thanks

fabriciojs commented 1 year ago

Got the same thing happening, latest Laravel & Horizon versions.

Jobs even though processed correctly keep accumulating on the Pending list in the Dashboard, although the queues are empty.

I could not find more accurate reports or solutions specifically to this situation.

@khalidgxg did you learn what was causing it for you?

And @driesvints I assume if one can showcase a repo with Laravel+Horizon that could consistenly show/reproduce the issue happening as reported, it would then qualify for you guys to pursue it as a bug, right? I might try to put something together.

cccdz commented 1 year ago

@driesvints @fabriciojs https://github.com/laravel/horizon/issues/1034

driesvints commented 1 year ago

Hey all. This should have been fixed already in 4.x, does that not work for you? https://github.com/laravel/telescope/pull/1349

driesvints commented 1 year ago

Telescope that is.

cccdz commented 1 year ago

@driesvints https://github.com/laravel/horizon/issues/1185 This happens when the consumer executes the event faster than the production event, because horizon is implemented through events, which are first dropped into a queue and then triggered, so there may be a situation where the consumer finishes consuming before starting to execute the event.

cccdz commented 1 year ago
image
github-actions[bot] commented 1 year ago

Thank you for reporting this issue!

As Laravel is an open source project, we rely on the community to help us diagnose and fix issues as it is not possible to research and fix every issue reported to us via GitHub.

If possible, please make a pull request fixing the issue you have described, along with corresponding tests. All pull requests are promptly reviewed by the Laravel team.

Thank you!

driesvints commented 1 year ago

Thank you. We'd appreciate any help through a PR for this.

cccdz commented 1 year ago
image image

I have a suggestion, is that these two listening methods can be used Lua script, and then in the pushed method of the Lua script to determine whether the key already exists, if the key exists, then on behalf of the completion of the event has been carried out, it is not written to the pending ordered list of collections, just update the hash key, I do not know if this is feasible.

joelvh commented 1 year ago

@driesvints is there a way to release these reserved jobs to they can be processed again?

driesvints commented 1 year ago

I don't know sorry.

joelvh commented 1 year ago

@themsaid do you maybe know if it's possible to release these jobs that get stuck as reserved to be processed again? Thanks!

graemlourens commented 11 months ago

I'd like to add here as well that we're experiencing the same issue with jobs being stuck in 'pending' even if completed_at is present in horizon dashboard, and the jobs actually completed successfully.

We have not been able to determine the root cause. We dispatch millions of jobs a month, and it affects only approximately 20 per day. Still, it's rather unsettling and we'd love to find a solution.

graemlourens commented 11 months ago

@pnlinh you are WAY out of date with laravel, horizon & php. Please update to most recent versions and test again. There is no sense in asking for help with such outdated versions.

pnlinh commented 11 months ago

@pnlinh you are WAY out of date with laravel, horizon & php. Please update to most recent versions and test again. There is no sense in asking for help with such outdated versions.

Thanks for your suggestion but my project cannot upgrade now. I added delay value to jobs, it seems it works.

driesvints commented 11 months ago

@pnlinh please try to focus the discussion on supported Laravel/Horizon versions, thanks.

fwilliamconceicao commented 9 months ago

Even with updated versions I still have this issue.

ithuis commented 9 months ago

i set 'TELESCOPE_JOB_WATCHER' to false in config, and they all came flooding back into completed.

"Watchers\JobWatcher::class => env('TELESCOPE_JOB_WATCHER', false)"

mentioned https://github.com/laravel/telescope/pull/1349#issuecomment-1645425830

Kladislav commented 9 months ago

same issue..

driesvints commented 9 months ago

Hey all. Extra messages that you're experiencing this issue aren't really helpful. Instead, please try posting extra findings around the issue or help out with a PR, thanks.

lucaspanik commented 8 months ago

image image I have a suggestion, is that these two listening methods can be used Lua script, and then in the pushed method of the Lua script to determine whether the key already exists, if the key exists, then on behalf of the completion of the event has been carried out, it is not written to the pending ordered list of collections, just update the hash key, I do not know if this is feasible.

I've been facing this problem for almost 3 years, where the internal solution provided is sleep(3) inside all jobs. https://github.com/laravel/horizon/issues/1034

What does the answer above make sense, since with sleep(3) the events have time to orchestrate themselves normally.

Would this be a possible point of investigation?

fwilliamconceicao commented 8 months ago

image image I have a suggestion, is that these two listening methods can be used Lua script, and then in the pushed method of the Lua script to determine whether the key already exists, if the key exists, then on behalf of the completion of the event has been carried out, it is not written to the pending ordered list of collections, just update the hash key, I do not know if this is feasible.

I've been facing this problem for almost 3 years, where the internal solution provided is sleep(3) inside all jobs. #1034

What does the answer above make sense, since with sleep(3) the events have time to orchestrate themselves normally.

Would this be a possible point of investigation?

This worked for me for a couple of months, but since the application scaled up and we had more workload this became a huge headhache.

What I'm doing right now it's migrating everything for serverless services, jobs, and isolated applications with C#.

The only way to stop this behavior is to stop using Horizon for huge workloads.

cccdz commented 8 months ago
class RedisJobRepository extends HorizonRedisJobRepository
{
    /**
     * 保留
     *
     * @param $connection
     * @param $queue
     * @param JobPayload $payload
     * @return void
     * @throws RedisException
     */
    public function reserved($connection, $queue, JobPayload $payload): void
    {
        // 循環總時長
        $totalTime = 0;

        // 如果horizon的任務不是pending狀態
        while ('pending' !== redis('horizon')->hget($payload->id(), 'status')) {
            // 如果循環時間大於等於1s
            if ($totalTime >= 1000000) {
                break;
            }

            // sleep 5ms
            usleep(5000);

            $totalTime += 5000;
        }

        parent::reserved($connection, $queue, $payload);
    }
}

$this->app->singleton(JobRepository::class, RedisJobRepository::class);

It can be temporarily avoided in this way.

fwilliamconceicao commented 8 months ago
class RedisJobRepository extends HorizonRedisJobRepository
{
    /**
     * 保留
     *
     * @param $connection
     * @param $queue
     * @param JobPayload $payload
     * @return void
     * @throws RedisException
     */
    public function reserved($connection, $queue, JobPayload $payload): void
    {
        // 循環總時長
        $totalTime = 0;

        // 如果horizon的任務不是pending狀態
        while ('pending' !== redis('horizon')->hget($payload->id(), 'status')) {
            // 如果循環時間大於等於1s
            if ($totalTime >= 1000000) {
                break;
            }

            // sleep 5ms
            usleep(5000);

            $totalTime += 5000;
        }

        parent::reserved($connection, $queue, $payload);
    }
}

$this->app->singleton(JobRepository::class, RedisJobRepository::class);

It can be temporarily avoided in this way.

This is a good solution tho. But have you tested with a huge workload? My workload's very big and when I started adding 2000 sleep everything started to overlap. I didn't try with 5k, might be it's a good workaround but anyway, it's not a good solution.

cccdz commented 8 months ago
class RedisJobRepository extends HorizonRedisJobRepository
{
    /**
     * 保留
     *
     * @param $connection
     * @param $queue
     * @param JobPayload $payload
     * @return void
     * @throws RedisException
     */
    public function reserved($connection, $queue, JobPayload $payload): void
    {
        // 循環總時長
        $totalTime = 0;

        // 如果horizon的任務不是pending狀態
        while ('pending' !== redis('horizon')->hget($payload->id(), 'status')) {
            // 如果循環時間大於等於1s
            if ($totalTime >= 1000000) {
                break;
            }

            // sleep 5ms
            usleep(5000);

            $totalTime += 5000;
        }

        parent::reserved($connection, $queue, $payload);
    }
}

$this->app->singleton(JobRepository::class, RedisJobRepository::class);

通过这种方式可以暂时避免。

这是一个很好的解决方案。但你测试过巨大的工作量吗?我的工作量非常大,当我开始添加 2000 睡眠时,一切都开始重叠。我没有尝试使用 5k,这可能是一个很好的解决方法,但无论如何,这不是一个好的解决方案。

I this is within 1 second to detect whether it is a pending state, every 5ms cycle detection, if it is a pending state means that the event has been executed, you can carry out the next operation, I also do the anti-dumbness, if 1 second after the event has not been executed, the task will not care about it, so that he stays in the pending list, but this extreme case is almost zero!

driesvints commented 5 months ago

Hi all. This issue is now one year old. We haven't gotten any action on it any longer and nobody seems to have attempted a PR. There for we'll be closing this one. If anyone still finds a solution to this one we'd be more than willing to accept a PR. Thanks

graemlourens commented 5 months ago

Just for the record: the issue is happening for us nearly daily, even with a very low system baseload of 1 million jobs a day, whereby approximately 3-10 jobs still remain in pending, even if completed.

Still happens with most recent laravel and horizon version currently available.