Closed nckrtl closed 5 months ago
Hi @nckrtl. Thanks for the very detailed issue! This indeed seems a bit worrisome... At this time I have no idea what would cause this. I'll mention this internally but for now will keep this open and labeled to get some community help if anyone is able to look into this one.
Thank you for reporting this issue!
As Laravel is an open source project, we rely on the community to help us diagnose and fix issues as it is not possible to research and fix every issue reported to us via GitHub.
If possible, please make a pull request fixing the issue you have described, along with corresponding tests. All pull requests are promptly reviewed by the Laravel team.
Thank you!
I have a feeling it has to do with downscaling the processes. When you wait for the downscaling to finish, which takes about 5 seconds after horizon is started (with no jobs in the queue) it feels like I can reproduce the issue with a higher success rate.
When fully commenting out the content of the scaleDown function at Laravel\Horizon\Processpool.php I can't seem to be able to reproduce the issue at all.
When putting the maxProcesses in the horizon config to 1 I can't seem to reproduce the issue either. This could point the scaleDown function as well as it won't get triggered when the maxProcesses is set to 1.
I think that the processes being spawned are not created entirely equal. In the scaleDown function the amount of processes that need to be scaled down by are taken from the front of the processes array.
// Here we will slice off the correct number of processes that we need to terminate
// and remove them from the active process array. We'll be adding them the array
// of terminating processes where they'll run until they are fully terminated.
$terminatingProcesses = array_slice(
$this->processes, 0, $difference
);
And in removeProcess():
protected function removeProcesses($count)
{
array_splice($this->processes, $count);
$this->processes = array_values($this->processes);
}
I have a hunge that the first process in the array is somewhat special. When I make sure it takes the processes to scale down by from the back of the process array I can no longer seem to reproduce the issue:
scaleDown():
...
$terminatingProcesses = array_slice(
$this->processes, -1, $difference
);
...
removeProcesses():
protected function removeProcesses($count)
{
array_splice($this->processes, -1, $count);
$this->processes = array_values($this->processes);
}
My knowledge about processes and horizon is pretty limited so I would have to do some research to find out if this assumption about the importance of the order is correct. Will also do some more testing later, but if anyone would be looking into it I think the above information could be helpful.
One side note, when there are at least 2 processes active the whole order thing doesn't seem to apply as I can't seem to reproduce the issue in that case. So maybe it's not entirely order related, or maybe only when there is one process active.
This issue may be related to the one described in #1433. I have faced a similar issue where processes marked for termination by scaleDown()
function can be killed upon calling horizon:terminate
.
@nckrtl does the PR from @tarexme solves your issue?
@driesvints yes that fixes it indeed.
As soon as a process is terminating it isn't seen as a running process anymore, resulting $this->processPools->map->runningProcesses()
to not include the process that is terminating. While at the same time $this->terminatingProcesses()
does show the process that's still doing work. Also $this->terminatingProcesses()
is empty once the terminating job finishes. So the PR definitely makes sense.
Closing this as it seems to be resolved with #1454.
Horizon Version
5.24.4
Laravel Version
11.8.0
PHP Version
8.3.7
Redis Driver
PhpRedis
Redis Version
5.3.7
Database Driver & Version
sqlite, 3043002
Description
When there is 1 job being processed by an horizon worker while a horizon:terminate is initiated the worker process often stops before the job finishes. The job will get stuck in the pending state and will finally get marked as failed after the retry_after threshold. The Job won’t get retried after you’ll restart the horizon worker, although the horizon ui will show its been retried multiple times once it's marked as failed.
This behaviour only seems to occur when there is 1 job being processed and a horizon:terminate is being initiated. When there are multiple jobs being processes of if there are multiple jobs in the queue the early shutdowns never seem to occur.
I also made a video demonstrating the behaviour: https://www.youtube.com/watch?v=GRgw2LWyLto
0:00 Run 1: job gets stuck 0:27 Run 2: job finishes 1:04 Run 3: job finishes 1:47 Run 4: job gets stuck 2:17 Run 5: job finishes 2:55 Run 6: job gets stuck, also shows it wont continue after turning horizon back on 3:49 Run 7: job finishes 4:33 Run 8: job gets stuck
I tested if it was related to the cache driver, but it seems to happen with any cache driver. Als the database driver seems irrelevant. I think there is some race condition that's causing this behaviour as it seems completely random.
I'm aware that this is an edge case, as I don't think it will happen often that horizon is processing exactly one job while shutting down. However, I think it would be a problem worth solving as it would make Horizon more reliable.
I'll investigate this further myself and source dive into the horizon package and see if I can find any issues. I just wanted to share this bug report already so I could get feedback and find out if others can reproduce the issue as well.
Steps To Reproduce
Clone https://github.com/nckrtl/horizon-stops-too-soon and follow the readme to perform the runs as shown in the video.