symbiote / silverstripe-queuedjobs

A module that provides interfaces for scheduling jobs for certain times.
BSD 3-Clause "New" or "Revised" License
57 stars 73 forks source link

kill messages from cron #265

Open oddnoc opened 4 years ago

oddnoc commented 4 years ago

The ProcessJobQueueTask (run through cron) is occasionally but persistently sending emails saying kill: 47765: No such process (different process ID each time). This seems to happen only for the immediate queue. Sorry for the lack of details; I don't see in the code where the task even tries to invoke kill.

Version info

Cheddam commented 4 years ago

Hi @oddnoc, thanks for raising this issue! Can you confirm whether this has been occurring for a while, in previous versions of SilverStripe CMS / Queued Jobs, or if it has just started after an update?

oddnoc commented 4 years ago

I'm not sure when this started, sorry. But I did manage to capture a snapshot of what's happening. I have a vagrant box with zfs, so I can roll the database back at will. Using this platform, I captured the output of ps as fast as possible whilst running the queue.

The command I ran, with its output:

./vendor/bin/sake dev/tasks/ProcessJobQueueTask queue=immediate                                    vagrant@diodes
Running Task Symbiote\QueuedJobs\Tasks\ProcessJobQueueTask

[2019-10-11 10:31:57] Running Clean up old jobs from the database and others from queue 1.
[2019-10-11 10:32:00] No new jobs on queue 1
[2019-10-11 10:32:01] No new jobs on queue 1
[2019-10-11 10:32:02] No new jobs on queue 1
[2019-10-11 10:32:03] No new jobs on queue 1
kill: 20615: No such process

I was capturing ps like so:

while true; do  ps aux >> processes;  done

grep 20615 processes yielded many lines, ending with these:

vagrant       20615 12.0  1.7  119584   70808  0  R+   17:32     0:00.79 /usr/local/bin/php /usr/home/vagrant/Project/docroot/vendor/silverstripe/framework/cli-script.php dev/tasks/ProcessJobQue                                                                                                     
vagrant       21338  0.0  0.0     440     352  0  R+   17:32     0:00.00 sh -c kill -9 '20615'

Additional version info:

In this environment, the message from kill happens only when a very specific set of QueuedJobDescriptor records exists in the database. Deleting any one of these records prevents the message from appearing. On this box with this set of records, the issue is 100% reproducible. I suspect this is a race condition, because it's happening intermittently in production with quite a different set of records.

+------+-----------------------------------------------------------------------------+---------+---------------------------------------------------------------+---------------------+
| ID   | JobTitle                                                                    | JobType | Implementation                                                | StartAfter          |
+------+-----------------------------------------------------------------------------+---------+---------------------------------------------------------------+---------------------+
| 5249 | Export a CSV of a Gridfield                                                 | 1       | CLIENT\Web\Report\CustomReportSupport\GenerateCSVImmediateJob | NULL                |
| 5250 | Upload parts for /products/power-management/led-drivers/flash-led-drivers/. | 1       | CLIENT\Web\Queuedjob\ProductFolderUploadJob                   | NULL                |
| 5251 | Clean up old jobs from the database                                         | 1       | Symbiote\QueuedJobs\Jobs\CleanupJob                           | NULL                |
| 5252 | Clean up old jobs from the database                                         | 1       | Symbiote\QueuedJobs\Jobs\CleanupJob                           | 2019-08-20 13:07:55 |
+------+-----------------------------------------------------------------------------+---------+---------------------------------------------------------------+---------------------+
oddnoc commented 4 years ago

We've started occasionally receiving more verbose output from cron. I hope this is helpful in tracking this issue down.

kill: 25326: No such process [2020-01-23 00:05:03] error-log.ERROR: Uncaught Exception Error: "Call to a member function hasMethod() on null" at /var/silverstripe/live/releases/A/docroot/vendor/symbiote/silverstripe-queuedjobs/src/Jobs/DoormanQueuedJobTask.php line 161 {"exception":"[object] (Error(code: 0): Call to a member function hasMethod() on null at /var/silverstripe/live/releases/A/docroot/vendor/symbiote/silverstripe-queuedjobs/src/Jobs/DoormanQueuedJobTask.php:161)"} []

mfendeksilverstripe commented 4 years ago

As a first thing, I recommend upgrading to symbiote/silverstripe-queuedjobs 4.5 latest.

I couldn't reproduce the issue after applying these two fixes:

@oddnoc Could you please try reproduce the issue after applying those two fixes?

oddnoc commented 4 years ago

@mfendeksilverstripe When I try to merge both fixes, I get conflicts. Could you resolve those into an integration branch?

mfendeksilverstripe commented 4 years ago

That's because one goes to 4.5 and the other one goes to 4. I suggest to cherry pick the commits instead of merging.

oddnoc commented 4 years ago

I had to do more than just cherry-pick. See the pu branch on my fork. I have deployed that branch and will know whether it works in a few days, to allow enough time for some jobs to be run from cron.

oddnoc commented 4 years ago

We got a result sooner than expected:

[2020-06-16 22:40:01] error-log.DEBUG: [2020-06-16 22:40:01] - Queued Jobs - Failed to acquire job lock Failed to read job lock 854 {"file":"/var/silverstripe/live/releases/6/vendor/symbiote/silverstripe-queuedjobs/src/Services/QueuedJobService.php","line":762} []

mfendeksilverstripe commented 4 years ago

This isn't necessarily an issue - see comments near the line where the exception gets throw:

// note that error here may not be an issue as failing to acquire a job lock is a valid state
// which happens when other process claimed the job lock first

This is why this message is debug level. Apart from this, have you noticed any issues?

oddnoc commented 4 years ago

Since that one message, cron has been silent for days and queued jobs have been running normally.