Lister / InputfieldSelector "Does Not Have" selectors built incorrectly - very slow

adrianbj commented 2 years ago

Short description of the issue

When building a Does Not Have selector via Lister, it creates the selector like: actions!=15517 but in my example this results in a 21 second runtime for the query.

Expected behavior

The selector should be built to be like !actions=15517 which only takes 785ms.

Optional: Screenshots/Links that demonstrate the issue

Optional: Suggestion for a possible fix

Honestly I haven't really ever understood when it's appropriate to use !field=15517 vs field!=. I have noticed that sometimes the latter doesn't work so I end up using the former, but maybe it should always be used because in some cases it clearly generates a much more efficient SQL query?

Would it be possible (make sense) to automatically convert them, or is there an intentional difference?

Actually, I might have answered my last question - there is a difference (23984 - 23425 = 559) in the returned number of results - I'm not sure which is actually correct at the moment, but it explains why I have switched from one to the other before when I haven't got results I am expecting.

adrianbj commented 2 years ago

This is possibly completely unrelated, but I was looking at figuring out the difference between the pages returned for one version vs the other and I ran the same query 4 times and got 4 different results:

Not sure if it's specifically related to the sort=-created not working properly or if it's an inconsistency in the one of the two versions of the != selector.

BernhardBaumrock commented 2 years ago

Thx adrian for always having a closer look on such issues! I'm always too lazy and think it's more likely that I'm doing something wrong than that I've found a bug... So I simply refactor and that's it :)

adrianbj commented 2 years ago

Haha - I usually take your approach, which works when you're writing the selector by hand, but when you're building it via the InputfieldSelector, you're bound by what it generates, which is this case is broken.

The docs do kind of mention the difference between the two operators in the "Negative operators"section of https://processwire.com/docs/selectors/ but that example is with text fields - I don't think there are any examples of when to use which for page reference fields. I think likely it should always be !field=xxxx but I have a feeling I have code littered with field!=xxxx, possibly returning incorrect results and taking way longer than it should.

ryancramerdesign commented 2 years ago

@adrianbj Using the != is what you want. The purpose of the ! before the field is so that you can negate an operator which has no opposite. This is for specific cases and intended for text operators like *=, ~=, etc. as it's a feature built into PW's engine for performing fulltext index queries on text. As far as I know !field= is not implemented outside of cases like that, though maybe there are exceptions. But I am thinking that your !actions=15517 is probably returning something other than what you want, whereas actions!=15517 should be returning what you want. You also want email!="" rather than !email="" since you are not using an operator that lacks an opposite.

I don't know why it's slow in your case though. I'm trying to duplicate that here, and it is fast, though I'm matching a smaller set of pages rather than the 24k like you are, so scale is likely part of it. Though 21 seconds sounds like there's a bottleneck somewhere. Because you've got a "limit=" so that should still be fast. 21 seconds sounds like it's loading all those pages or something.

Here's what I'm testing with:

$timer = Debug::timer();
$items = $pages->find("template=sites-item, sites_categories!=1414, include=all, limit=25, sort=-created");
echo "Found " . $items->getTotal() . " items in " . Debug::timer($timer);

Result:

Found 989 items in 0.0248

After converting to the != operators, start removing stuff from your selector to see what might be making it slow. Leave the limit, but remove the sort, test, remove the email, test, remove the actions, test, etc. I'd be curious what you find. Thanks.

adrianbj commented 2 years ago

Thanks @ryancramerdesign.

Lister that is generating the selector, including the !email="" version, and adding the limit. I am not sure that it's loading them all. I think the slowdown is because the ineffciency of the SQL that is being created. Note that in this test today it was 48 seconds.

Additionally, in Lister this becomes even more of a problem if you add extra selectors while this is still processing - you end up with additional AJAX spinners and it can take literally minutes to complete the search.

Take a look at the SQL that is being created completely by Lister (InputfieldSelector)

Compare that SQL to what gets generated with the !action=xxxxx version:

ryancramerdesign commented 2 years ago

@adrianbj It doesn't matter much how fast the second one is because it's not performing the same match, so that query probably isn't a good comparison. Your first query (the slow one) has some strange duplication that I don't see in my tests. While that redundancy shouldn't matter to MySQL, it is strange so I'm still curious if you can identify the source of the bottleneck? This can be identified by removing stuff till the slowdown disappears. We can't really tell just by looking at the queries because how fast/slow something is has more to do with what indexes (or lack of them) come into play. My test query (which is fast) looks the same as yours, but without the redundant WHERE statements, so maybe there's something to that, but I'd be curious where they are coming from.

adrianbj commented 2 years ago

@ryancramerdesign - I do understand it's not the same match and by checking the diff between the two, it does appear that in this case, I do want actions!=, so I suppose the title of this issue is incorrect - apologies for that.

I am curious why Lister is generating !email= though. Is that intentional?

As for the speed of the actions!= query - I'll try to figure out what part of the query is causing the slowdown.

adrianbj commented 2 years ago

Ok, this is weird - it depends on which action I am trying to NOT HAVE.

With one of the selections it's as quick as 1.6s and returns 37628 results.

But some of the other actions it can take up to 46s and while returning 41299 results.

It doesn't seem to me that the difference between 37628 and 41299 should have any impact on the performance of this. I have even tested the generates SQL directly and just changing the page ID of the action in the SQL makes all the difference, so this excludes any possible weirdness in Lister / InputfieldSelector.

Any thoughts on what this could possibly be?

netcarver commented 2 years ago

@adrianbj If you run your queries again but prepended with "explain " do you see any difference in the number of rows scanned in the output or if your slow query is not using any indexes or something?

adrianbj commented 2 years ago

I haven't narrowed down the reason for the difference for the different actions, but I think there are still some big performance gains to be had. Compare these two queries. Note how much quicker the second one is (6s vs 19s) - all I did was append _1 to field_actions in the JOIN and the subquery.

I'll take a look at EXPLAIN in a minute - thanks @netcarver

adrianbj commented 2 years ago

Very interesting results from EXPLAIN actually - definitely some index differences - the one on the right is the "faster" action.

adrianbj commented 2 years ago

This explains that "ref" is faster than "range" https://stackoverflow.com/questions/33916043/mysql-sql-plan-what-type-range-meaning-in-this-sql

but why has it changed? Could it be that one of the page IDs is longer than the other? eg: 4039 (faster) vs 11438 (slower)?

PS: MySQL Server: 5.5.5-10.3.31-MariaDB-0+deb10u1-log MySQL Client: mysqlnd 5.0.12-dev - 20150407 - $Id:

netcarver commented 2 years ago

And if you prepend with "explain analyze ..." I think it should show you the actual run timings of each part of your queries.

netcarver commented 2 years ago

Hmm, in the initial query you posted, there seem to be repeated clauses in the SQL as well - though I'd hope the query optimizer would dedupe these... Selection_003

adrianbj commented 2 years ago

Unfortunately the ANALYZE option isn't available on that version of MySQL.

There are a few articles mentioning this issue of using range instead of ref

https://www.eversql.com/mysql-is-not-using-my-index-join-with-a-range-condition/

https://dba.stackexchange.com/questions/244095/mysql-5-6-users-index-instead-of-range-type-select-on-a-big-table-and-simple-que

Interestingly, for the version above where I appended the _1 to field_action it ends up using index, rather than ref or range but then the possible_keys ends up as NULL.

Regarding that duplication - I don't seem to be getting that now at least - not sure why that appeared then.

adrianbj commented 2 years ago

@ryancramerdesign and @netcarver - I've made some significant progress.

I added USE INDEX(data) to each of the field_actions SELECT statements and now the query is less than a second.

Can you see any reason why PW shouldn't be using this?

Note that FORCE INDEX is deprecated, so USE INDEX seems to be the currently recommended method.

netcarver commented 2 years ago

@adrianbj Nice speed-up. I PM'd you an idea about trying a covering index as well, did you manage to try that out by any chance?

adrianbj commented 2 years ago

@netcarver - I just went to test you covering index idea and while cloning the site and DB to test, I found that things were plenty fast enough with the cloned version, so I looked at the indexes for the field_actions table and found these differences. Ignore the difference from 36998 to 37004 - this is a very active site and user interactions resulted in that change while I was playing around with things.

The key thing is obviously those cardinality differences. Seems like there were some issues with the indexes, but I have no idea what would cause that.

Original

Cloned

Anyway, I ran a table OPTIMIZE and it fixed those indexes and things are speedy again and adding USE INDEX(data) no longer makes any difference.

That said, I am still seeing a fairly significant improvement (1100ms to 800ms) with the _1 suffix that I mentioned above, so I think that would be a worthwhile addition. @ryancramerdesign - do you think this would be worth adding?

adrianbj commented 2 years ago

Sorry, and back to the covering index - I added it and it went from the 800ms (after adding the _1) down to 560ms, so between those two changes, that has halved the time. Is that something else worth adding Ryan?

At @netcarver's suggestion I ran this:

ALTER TABLE `pages` ADD INDEX `idx_pages_cover` (`id`, `parent_id`, `templates_id`, `created`);

netcarver commented 2 years ago

@adrianbj Unfortunately that covering index was made specifically for your posted SQL query, it wouldn't help other queries unless they also happen to use (potentially only) those columns. If many generated PW queries do happen to use them, it might be worth adding the index - but each index brings in additional overhead too.

That being said, it might be possible to make something for PW that looks for index issues or suggests covering indexes from generated queries.

adrianbj commented 2 years ago

Hi @netcarver - note that I only add the covering index for pages. I didn't add the other one you suggested, because field_actions is a page reference field, and therefore an integer. It looks to me like that index is generic enough to be useful for many queries, or am I missing something?

ryancramerdesign commented 2 years ago

That said, I am still seeing a fairly significant improvement (1100ms to 800ms) with the _1 suffix that I mentioned above, so I think that would be a worthwhile addition.

Maybe so, but I'd like to understand why this is providing that performance improvement, I'm not totally clear about that yet... if it's something related to your particular situation or something or some underlying issue with the query. Maybe it's having some conflict with the field_actions in the subquery? Do you find the same performance improvement if you instead rename the field_actions to field_actions_1 in the subquery(ies) rather than the left join?

Is that something else worth adding Ryan?

I'm not sure in this case, I don't usually think of creating indexes that start with the primary key, but I'm not an expert on DB indexes. @netcarver Can you tell me more about this particular index? Might it be just as effective with just (parent_id, templates_id, created) or is the id needed first because of the group by pages.id?

netcarver commented 2 years ago

@ryancramerdesign I don't know enough to answer this with any authority - but: a covering index holds all the fields referenced in the query as nodes in a single B-Tree index, if all the stored values are simple types, this allows all the values for the entire query to be answered just from the index without needing to access the underlying row storage. If the index is cached in memory, you'd not need to make system calls to page in from slower storage as well. Having just the one index means multiple indexes don't need to be cross-referenced. Further, as the index is sorted, and if the index sort order matches the sort order of the query, you can get another win as the results don't need to be further ordered.

However, there will be some implementation diffs between MyISAM and InnoDB (last time I looked at the way they keep their indexes.) In InnoDB will already include a reference to the primary key in the secondary index - so you are right, if the query is using InnoDB, you could probably totally drop the id column from it because the engine is going to include it anyway.

That's pushing the limits of my knowledge on MySQL and it's based on old information and aging grey matter.

ryancramerdesign commented 2 years ago

@netcarver Thanks. Do you think it makes sense for ProcessWire to have such a covering index on this table, or better to leave it for people to add their own for the cases where they would benefit from it?

netcarver commented 2 years ago

@ryancramerdesign Sorry for the delayed reply - probably best to benchmark something and use that to decide.

processwire / processwire-issues