magemojo / m2-ce-cron

Magento 2 cron project to fix bugs and common cron issues.
https://magemojo.com
MIT License
165 stars 45 forks source link

Jobs executing multiple times during one cron execution #40

Closed JesKingDev closed 6 years ago

JesKingDev commented 6 years ago

What version of Magento are you using (the full version ex. 2.1.11)?

I currently have 2 environments exhibiting the behavior: Environment TR Magento Open Source 2.2.2 single node

Environment ICX Magento Open Source 2.2.0 multiple nodes crontab checks for 'leader' server and runs magento cron only if current server is the leader

What did you do? Installed crontab as per Example 1 at https://devdocs.magento.com/guides/v2.0/config-guide/cli/config-cli-subcommands-cron.html. Created a custom cron group to process our orders to a backend fulfillment system.

What did you expect to see? Group cron jobs are executed once on schedule (currently every 2 minutes for Environment TR and every 10 minutes for environment ICX).

What did you see instead? Job is sometimes executed 2 or 3 times with the same timestamp. I have observed from log files that at least 2 different jobs (different vendors/schedules) are experiencing sporadic instances of duplicate or triplicate executions.

screen_shot_2018-06-25_at_20_49_35
gnuzealot commented 6 years ago

Do you have a multinode environment and are executing the cron on all the nodes?

JesKingDev commented 6 years ago

We have a multinode environment, yes, but we are doing a "leader" test before running the cron, to protect against the crons executing on all nodes. I've confirmed in the logs that only one server in the cluster is executing the magento cron and the others are skipping execution. Further, the log file snippet above is from the leader server's log file and isn't a combined log from multiple nodes.

gnuzealot commented 6 years ago

The cron runs as a service so it executes once and keeps running, so if it's running on one node and the leader changes it's going to keep running on the node that was the leader and then start on the new leader as well. There would need to be something that kills that process on what was the leader or you'll get overruns.

JesKingDev commented 6 years ago

@gnuzealot Thanks for digging into this issue. Screenshot below is the log that's generated when the * cron runs. The one on the left is the leader. I'm testing every time it runs, so there shouldn't be any overlap. image

JesKingDev commented 6 years ago

@gnuzealot As luck would have it, I was able to capture the cron processes executing multiple times. In the screenshot below, you can see that the same schedule has been executed twice. The scheduleid is 3235534 and there are two PIDs - 8404 and 8405. They have been launched simultaneously, which results in each order being processed twice.

Your help with this is very much appreciated. I am a big fan of this extension (which is why I'm installing it on all my client environments) however this duplicate processing of orders is a big deal and can't be sustained.

image

Another capture shows that Magento core schedules are also being duplicated.

image
gnuzealot commented 6 years ago

So I can see in the one screenshot that there are two parent cron processes running as well. Pid 3045 and 6131. That shouldn't be occurring. There is a check on the execution of the process that looks for the existence of var/cron/cron.pid if that file exists and the pid in there is an active process id then the code will exit and just write the line that it healchecked the service to the cron.log. For some reason that check failed and it started up and ran along picking up all the same jobs. Somehow the cron.pid wasn't found. I'll think about ways to make that check more robust to catch that case, however it was caused.

JesKingDev commented 6 years ago

Yes, @gnuzealot I believe you're right. Yesterday, I killed all magento cron pids and this morning I see that there are two that are running again this morning. I've poured over the code and don't see how it's possible, but there's obviously some scenario that the code isn't catching. I am looking into this as well and if I come up with anything I'll submit a PR to you. Thanks for your help.

image

JesKingDev commented 6 years ago

@gnuzealot I've opened https://github.com/magemojo/m2-ce-cron/pull/42 which has been running successfully in 5 of my environments for a few days to prevent overruns and solve this issue.

gnuzealot commented 6 years ago

PR #42 has been merged