Closed tfrommen closed 3 years ago
We dropped the index nextrun
on the table mentioned in the issue description and saw a dramatic improvement in the query performance and hence database load/Cavalcade job execution throughput.
The RDS CPU usage dropped from ~80% to 4% immediately after dropping the index -
And cavalcade jobs execution throughput can be seen in this screenshot, This was executing ~130 jobs per hours from this it executed 1.5K jobs in ~30 minutes, only change in between was the dropped index -
Thanks for the detailed ticket and graphs folks.
The new indexes where added as a follow up to #75 as more complex queries were been made as the site loaded. The theory was that it would speed them up. Evidently this turned out to be incorrect for larger sites or once the jobs table grew to include a number of failed and completed jobs.
As Cavalcade is designed for larger sites, I am happy if it's removed.
As a completely ignorant question, does the order of the WHERE
clause have an effect, ie is this faster:
SELECT *
FROM wp_cavalcade_jobs
WHERE site = 19
AND hook = 'batch_export_pages_0'
AND args = '<SERIALIZED-ARRAY>'
AND status IN( 'waiting', 'running' )
AND nextrun BETWEEN '1970-01-01 00:00:00' AND '2020-11-27 08:48:16'
ORDER BY nextrun ASC
LIMIT 1
;
Also, I gather the above is run while scheduling a single task within the next ten minutes or a zero timestamp. The time clause when scheduling such tasks could be simplified too:
SELECT *
FROM wp_cavalcade_jobs
WHERE site = 19
AND hook = 'batch_export_pages_0'
AND args = '<SERIALIZED-ARRAY>'
AND status IN( 'waiting', 'running' )
AND nextrun <= '2020-11-27 08:48:16'
ORDER BY nextrun ASC
LIMIT 1
As a completely ignorant question, does the order of the
WHERE
clause have an effect, ie is this faster:SELECT * FROM wp_cavalcade_jobs WHERE site = 19 AND hook = 'batch_export_pages_0' AND args = '<SERIALIZED-ARRAY>' AND status IN( 'waiting', 'running' ) AND nextrun BETWEEN '1970-01-01 00:00:00' AND '2020-11-27 08:48:16' ORDER BY nextrun ASC LIMIT 1 ;
I haven't tested, but I think this wouldn't make a difference, as the nextrun
column is still being used for both querying (in the WHERE
clause) and sorting (in the ORDER BY
clause).
In general, MySQL will pick the index that means to search the smallest number of rows. Looking at the Cavalcade jobs table, this can be different across projects, though.
As I mentioned before, it would be possible to use IGNORE INDEX (nextrun)
to tell MySQL not to use the nextrun
index. Similarly, adding USE INDEX (status)
would tell MySQL to use status
, and thus not nextrun
.
But I am still not convinved the nextrun
index brings any value at all. Using Cavalcade in a meanigful way, I don't think you would ever ask the database something like "Give me the next scheduled job after timestamp X".
Instead, you would query for waiting or running jobs, or completed ones. Or you would ask for the time that a particular job is scheduled next. Or something like that.
From that same project, this is the TOP 5 of queries over the last five days (other than the above one):
SELECT * FROM wp_cavalcade_jobs WHERE site = 1 AND hook = '<SOME-HOOK>' AND args = '<SERIALIZED-ARRAY>' AND status IN( 'waiting', 'running' ) ORDER BY nextrun ASC LIMIT 1
SELECT COUNT(*) as count from wp_cavalcade_jobs WHERE nextrun < NOW() AND status = 'waiting'
SELECT * FROM wp_cavalcade_jobs WHERE status = 'running' ORDER BY args LIMIT 0, 1000
SELECT * FROM wp_cavalcade_jobs WHERE status != 'completed' ORDER BY id DESC LIMIT 0, 1000
SELECT * FROM wp_cavalcade_jobs LIMIT 0, 1000
None of them would make good use of the nextrun
index, I don't think.
Would be good to also get some input from either @rmccue or @roborourke...
Thanks for this in depth analysis @tfrommen and @Nikschavan. This seems like an overwhelming argument to remove the index.
Let's make that happen!
Agreed, let's drop it.
Thanks @tfrommen - this is now available in v2.0.2
Thanks folks, sorry I broke your product.
@peterwilsoncc I'll send you the invoice.
So I upgraded from 2.0.1 to 2.0.2 (using composer), and I'm still seeing the nextrun
index. Am I expected to do something to trigger the db migration that drops it?
@archon810 you would need to run wp cavalcade upgrade
(or trigger the upgrade routine some other way).
@tfrommen Thank you, that did it. What other ways trigger the upgrade routine? Shouldn't it run automatically at some point after the update? This is probably something that should be in release notes, otherwise nobody's going to get the benefits of the db change unless they know to run the upgrade command (and things would be a lot more disastrous if the db structure actually contained breaking changes).
@archon810 It doesn't run automatically as the upgrade operation can affect the database in load-affecting ways, such as locking writes or just doing a lot of writes. Cavalcade is designed to continue operating with an non-upgraded database (similarly to WordPress).
We can probably do a better job of documenting this though. :)
In #95 (👋 @peterwilsoncc 🙂), several new indexes on the jobs table were introduced, one of them being
nextrun
, on the column with the same name.I would like to question the benefit of having said index, and, in fact, illustrate why I think it actually is a performance burden than anything else.
Context
Firstly, the
nextrun
column is of typedatetime
.Given the nature of the data, that is, information about cron tasks, it is highly unlikely to have several rows with the exact same timestamp. Sure, for high-traffic sites with lots of scheduled tasks, and/or bugs (or just logical errors) in the code scheduling said tasks, you might end up with a few tasks being scheduled for the exact same time. In reality, however, this is not often the case. And even if it was, I assume we would be talking about less than 10 and definitely less than 100 tasks. Nothing that an index would make a meaningful difference in.
Since no other index covers the
nextrun
column, MySQL (or whatever your DB is) will try and make good use of the dedicatednextrun
index, whenever thenextrun
column is used for querying and/or sorting/grouping etc.Problem
In theory, the
nextrun
index is a one-to-many mapping. However, in practice, it will most probably only exist as one-to-one (or maybe one-to-few) map, and thus one level of unnecessary indirection.I have not checked all (core) functionality or rather the respective queries including
nextrun
, but pretty much all usage of theJob::get_jobs_by_query
method that results in MySQL queries includingnextrun BETWEEN ... AND ...
ornextrun < ...
, will mean a huge performance burden.Here is a query taken from a live project:
Given a table with 1.7 million jobs, executing the above query takes about 4 seconds!
The
1970-01-01 00:00:00
timestamp—the Beginning of Time 😉—is due to using0
in case when an event's timestamp is more than 10 minutes in the past. However, the date range does not really pose an issue, and using something like2020-01-01 00:00:00
instead will perform just as badly.Possible Solutions
I see, at least, two ways to go about this:
If we can agree that the index is, in fact, useless, given the nature of time, let's just delete it.
If, on the other hand, there is something to gain by having the index—I'm looking forward to learning what that would look like—I suggest to ignore the index for queries that pose problematic. This can be done by enhancing
Job::get_jobs_by_query
to addIGNORE INDEX (nextrun)
to the query for a non-integer timestamp, that is, the'past'
string, or a range of timestamps.Granted, there is a third option, which is to regularly truncate (or clean up) the jobs table. Oftentimes, however, this is not possible or desired.
Data
Here is some real-world data, underlining the above suggestion of ignoring/deleting the index. For all queries, I ran
RESET QUERY CACHE;
right before them, to not have differing results due to cached data.Given a jobs table with almost 1 million rows (960,764), will execute the above query as follows:
As stated, this will search a total of 480,467 rows, which is about half the size of the whole table!
Execution, on the project's staging server, took 4.33114400 seconds.
This is the detailed profile (using
SET profiling = 1
andSHOW PROFILE FOR QUERY #
):The exact same query, now using
IGNORE INDEX (nextrun)
, will be executed like so:This is using filesort, and/but will only search a total of 448 rows! In this example, this is even less than 1% of the previous query using the index (to no avail).
The profile looks as follows:
Executing the query, overall, takes 0.00305075 seconds, or 3 milliseconds, which is less than 0.1% of the previous query execution time.
Thoughts?
I'm happy to hear anyone's thoughts on the current state, the problem that I explained, as well as any kind of possible solution! Especially interested to hear @peterwilsoncc's reasons for introducing the
nextrun
index. 🙂