humanmade / Cavalcade

A better wp-cron. Horizontally scalable, works perfectly with multisite.
https://engineering.hmn.md/projects/cavalcade/
Other
519 stars 46 forks source link

Some CRON events are not executed (Large WP Multisite) #113

Closed germanoronoz closed 3 years ago

germanoronoz commented 3 years ago

Hello,

We have Cavalcade installed in a WordPress multisite environment and we have realized that many events never get to be executed.

The network has more than 300 sites, domain-mapped.

In the database, the cavalcade_jobs table has more than 10,400 lines. The cavalcade_logs table has more than 6 million.

The problem we have is that there are a lot of events (recurring and non-recurring) that never get to be executed and are left with the "waiting" status. Some of these events have taken place but one or two months later.

We also see many jobs in "completed" status, but as we understand those should be removed from that table and go to the logs table.

Cavalcade is well installed and working because there are many other (recurring) events that are being executed.

We have the native WordPress cron disabled in the wp-config —> define( 'DISABLE_WP_CRON', true );

The CPU sometimes gets highly-loaded, when we have traffic peaks, but we think that is not the reason.

Any help would be appreciated, even with premium support from HumanMade if available.

Thanks a lot in advance, Regards

germanoronoz commented 3 years ago

Notes:

Thanks again!

rmccue commented 3 years ago

The problem we have is that there are a lot of events (recurring and non-recurring) that never get to be executed and are left with the "waiting" status. Some of these events have taken place but one or two months later.

It's highly unusual that you'd have waiting jobs go past their specified run time, but this could be an indication that you are getting a queue backlog due to too many tasks.

The runner works by continually querying for jobs to run; try querying this manually (without the LIMIT 1) to see if it fetches your waiting jobs. If this query returns more than a job or two, you may have a large backlog, and your servers are just struggling to process it.

We recommend automatic scaling based on the size of this backlog; we generate a PendingJobs metric in CloudWatch, and then scale Cavalcade's ECS cluster up when >=10 jobs have been pending across 3 minutes.

We also see many jobs in "completed" status, but as we understand those should be removed from that table and go to the logs table.

Completed jobs aren't removed from the database, they're just marked as completed, and a log entry is added which corresponds to it.

I wouldn't expect your table size to be an issue here; we're running sites with millions of jobs (i.e. one I just checked has 8,804,387 entries) with no huge issues, although we do occasionally prune the tables to save on storage.

germanoronoz commented 3 years ago

Hello Ryan,

First of all, thanks a lot for your reply.

When running this query:

SELECT * FROM prefix_cavalcade_jobs WHERE nextrun < NOW() AND status = "waiting" ORDER BY nextrun ASC

We are getting 10900+ results!

Those are since june that we installed Cavalcade until today. This is a dedicated server with 12 cores, 128GB RAM, etc., so we should have enough CPU I think.

The weird thing is that there are newer events that get executed on time...

For instance, taking this event with ID 3309 as an example:

image

As you can see next run was on 2021-06-29 07:05:02, but should be fired every minute.

I has been executed, if I go to the logs table I can see there the row:

image

But if I go back to the jobs table and look for it, it's still there (which is correct) and the next run does change to one minute later, i.e. 2021-06-29 07:06:02, BUT this makes one-time events that get scheduled for today for example to never be executed...

And other problem I am finding is that if I run one time (non-recurring) events using WP CLI, e.g.:

wp cavalcade run 5249

The CLI does not complain, but nothing happens, that job remains waiting.

Regarding completed jobs, is there any filter we could use to make cavalcade not store those on the main table?

Thanks again!!

germanoronoz commented 3 years ago

The weird thing is that there are newer events that get executed on time...

This was not true, sorry.

rmccue commented 3 years ago

We are getting 10900+ results!

Sounds like your core issue is that jobs are getting created faster than they can get handled then :D

You'll need to run more Cavalcade tasks, either through more Cavalcade Runner instances, or through running more tasks (see below)

For 300 sites with that many jobs, something closer to 20-30 parallel tasks would make most sense, but you'll need to measure this and work out those numbers yourself.

This is a dedicated server with 12 cores, 128GB RAM, etc., so we should have enough CPU I think.

Note that Cavalcade Runner will only run 4 tasks in parallel by default: https://github.com/humanmade/Cavalcade-Runner/blob/master/inc/class-runner.php#L36 - there's no great way to override that number right now, so you'll have to edit the source. (Happy to patch that! It should really be a CLI parameter so you can reconfigure in your systemd config.)

Our systems use horizontal scaling rather than vertical scaling, so we run multiple instances of Cavalcade in parallel and don't hit problems with that, but in your case (single, large server) increasing this number is more likely to be effective.

Regarding completed jobs, is there any filter we could use to make cavalcade not store those on the main table?

Up to how you want to handle that, Cavalcade isn't opinionated.

One way that you can do it is by scheduling a recurring job every month to delete completed jobs older than a month; i.e. DELETE FROM wp_cavalcade_jobs WHERE nextrun < DATE_SUB( NOW(), INTERVAL 30 DAY ) AND status = 'completed' - Note that you'll probably want to do the same on your logs table, otherwise the job IDs in the logs will be invalid :)

germanoronoz commented 3 years ago

Thanks again, it was helpful.

there's no great way to override that number right now, so you'll have to edit the source. (Happy to patch that! It should really be a CLI parameter so you can reconfigure in your systemd config.)

Maybe a wp-config constant?

I run this query:

UPDATE loremipsum_cavalcade_jobs SET nextrun = '2021-01-01 00:00:00', start = '2021-01-01 00:00:00' WHERE schedule IS NULL AND nextrun < NOW() AND status = "waiting" ORDER BY nextrun ASC

With this I have set all the one-time events to a date before those that seem to be super delayed, to get them to run earlier, with higher priority.

All those have been already executed and cleared now (great!!) and when running this query:

SELECT * FROM pom_standard_cavalcade_jobs WHERE nextrun < NOW() AND status = "waiting" ORDER BY nextrun ASC

Now we can only see recurring events with a super-due "nextrun" date. I think this is related to when we installed Cavalcade, may be we had a problem that caused this collapse.

But the issue is that the nextrun date gets set regarding previous value plus interval.

I think it should be set regarding the time it was executed plus interval instead.

In this way we would ensure that they are always set to a future date, respecting those that are ahead. As it is now, if this happens (having many recurring events with a long overdue nextrun date), the date of nextrun never exceeds today's date, so this error is dragged on ad infinitum.

What do you think?

rmccue commented 3 years ago

I think it should be set regarding the time it was executed plus interval instead.

This is an intentional design decision of Cavalcade; for example, if you have daily reports generated, these need to run for each day, and skipping any can cause problems. You can override this in your own job by scheduling manually after each run rather than using the built-in recurrence.

When you're manually updating them, better to set it to the current time if you don't need it at any particular recurrence.

germanoronoz commented 3 years ago

OK, understood, thanks.

One last question: we updated this: https://github.com/humanmade/Cavalcade-Runner/blob/master/inc/class-runner.php#L36

To 30, but when filtering the jobs with "running" status we only see 6 at a time.

This was also 6 (and not 4) before changing anything.

Regards

rmccue commented 3 years ago

To 30, but when filtering the jobs with "running" status we only see 6 at a time.

Hmm, not sure what could be causing that; you'd have to debug and look deeper to find this one out. It's worth checking the log and seeing if you're seeing "Out of workers" messages.

It's controlled by this code: https://github.com/humanmade/Cavalcade-Runner/blob/0dfb42d505e9cd870a11366c49ee680d327c961a/inc/class-runner.php#L118-L124

germanoronoz commented 3 years ago

Hmm, not sure what could be causing that; you'd have to debug and look deeper to find this one out. It's worth checking the log and seeing if you're seeing "Out of workers" messages.

There were a few of those.

I killed the cavalcade runner process and started again, and "running" processes started to get higher.

I also discovered that when running this query:

SELECT * FROM pom_standard_cavalcade_jobs WHERE nextrun < NOW() AND status = "waiting" ORDER BY nextrun ASC

NOW() uses the server time, which is UTC+2, while the firing order is based on UTC, and that was the reason why there were so many delayed events, many of them were not really delayed.

I changed /etc/my.cnf with:

[mysqld_safe] timezone = UTC

Apart from that, I agree to what you said:

Sounds like your core issue is that jobs are getting created faster than they can get handled then :D

So we will keep looking into it.

Thanks so much for your help. Regards