Open oddnoc opened 5 years ago
Hi @oddnoc, thanks for raising this issue! Can you confirm whether this has been occurring for a while, in previous versions of SilverStripe CMS / Queued Jobs, or if it has just started after an update?
I'm not sure when this started, sorry. But I did manage to capture a snapshot of what's happening. I have a vagrant box with zfs, so I can roll the database back at will. Using this platform, I captured the output of ps as fast as possible whilst running the queue.
The command I ran, with its output:
./vendor/bin/sake dev/tasks/ProcessJobQueueTask queue=immediate vagrant@diodes
Running Task Symbiote\QueuedJobs\Tasks\ProcessJobQueueTask
[2019-10-11 10:31:57] Running Clean up old jobs from the database and others from queue 1.
[2019-10-11 10:32:00] No new jobs on queue 1
[2019-10-11 10:32:01] No new jobs on queue 1
[2019-10-11 10:32:02] No new jobs on queue 1
[2019-10-11 10:32:03] No new jobs on queue 1
kill: 20615: No such process
I was capturing ps
like so:
while true; do ps aux >> processes; done
grep 20615 processes
yielded many lines, ending with these:
vagrant 20615 12.0 1.7 119584 70808 0 R+ 17:32 0:00.79 /usr/local/bin/php /usr/home/vagrant/Project/docroot/vendor/silverstripe/framework/cli-script.php dev/tasks/ProcessJobQue
vagrant 21338 0.0 0.0 440 352 0 R+ 17:32 0:00.00 sh -c kill -9 '20615'
Additional version info:
In this environment, the message from kill happens only when a very specific set of QueuedJobDescriptor records exists in the database. Deleting any one of these records prevents the message from appearing. On this box with this set of records, the issue is 100% reproducible. I suspect this is a race condition, because it's happening intermittently in production with quite a different set of records.
+------+-----------------------------------------------------------------------------+---------+---------------------------------------------------------------+---------------------+
| ID | JobTitle | JobType | Implementation | StartAfter |
+------+-----------------------------------------------------------------------------+---------+---------------------------------------------------------------+---------------------+
| 5249 | Export a CSV of a Gridfield | 1 | CLIENT\Web\Report\CustomReportSupport\GenerateCSVImmediateJob | NULL |
| 5250 | Upload parts for /products/power-management/led-drivers/flash-led-drivers/. | 1 | CLIENT\Web\Queuedjob\ProductFolderUploadJob | NULL |
| 5251 | Clean up old jobs from the database | 1 | Symbiote\QueuedJobs\Jobs\CleanupJob | NULL |
| 5252 | Clean up old jobs from the database | 1 | Symbiote\QueuedJobs\Jobs\CleanupJob | 2019-08-20 13:07:55 |
+------+-----------------------------------------------------------------------------+---------+---------------------------------------------------------------+---------------------+
We've started occasionally receiving more verbose output from cron
. I hope this is helpful in tracking this issue down.
kill: 25326: No such process [2020-01-23 00:05:03] error-log.ERROR: Uncaught Exception Error: "Call to a member function hasMethod() on null" at /var/silverstripe/live/releases/A/docroot/vendor/symbiote/silverstripe-queuedjobs/src/Jobs/DoormanQueuedJobTask.php line 161 {"exception":"[object] (Error(code: 0): Call to a member function hasMethod() on null at /var/silverstripe/live/releases/A/docroot/vendor/symbiote/silverstripe-queuedjobs/src/Jobs/DoormanQueuedJobTask.php:161)"} []
As a first thing, I recommend upgrading to symbiote/silverstripe-queuedjobs
4.5
latest.
I couldn't reproduce the issue after applying these two fixes:
kill: 47765: No such process
errors@oddnoc Could you please try reproduce the issue after applying those two fixes?
@mfendeksilverstripe When I try to merge both fixes, I get conflicts. Could you resolve those into an integration branch?
That's because one goes to 4.5
and the other one goes to 4
. I suggest to cherry pick the commits instead of merging.
I had to do more than just cherry-pick. See the pu branch on my fork. I have deployed that branch and will know whether it works in a few days, to allow enough time for some jobs to be run from cron.
We got a result sooner than expected:
[2020-06-16 22:40:01] error-log.DEBUG: [2020-06-16 22:40:01] - Queued Jobs - Failed to acquire job lock Failed to read job lock 854 {"file":"/var/silverstripe/live/releases/6/vendor/symbiote/silverstripe-queuedjobs/src/Services/QueuedJobService.php","line":762} []
This isn't necessarily an issue - see comments near the line where the exception gets throw:
// note that error here may not be an issue as failing to acquire a job lock is a valid state
// which happens when other process claimed the job lock first
This is why this message is debug level. Apart from this, have you noticed any issues?
Since that one message, cron has been silent for days and queued jobs have been running normally.
The ProcessJobQueueTask (run through cron) is occasionally but persistently sending emails saying
kill: 47765: No such process
(different process ID each time). This seems to happen only for the immediate queue. Sorry for the lack of details; I don't see in the code where the task even tries to invokekill
.Version info