"SELECT DISTINCT" on all queries is a performance sucker

tractorcow commented 8 years ago

I'd rather address and resolve all instances where using distinct is absolutely necessary, and resolve those. Using distinct on all queries to solve poorly formed queries is lazy.

Possible implementation (from Sam)

My recommendation for this, as a minor release change:

Set DISTINCT to false by default
Add a helper method enableGroupedDistinct() that adds a clause along the lines of GROUP BY 1,2,3,4,5.... This will be preferred over the DISTINCT clause
Whenever a non-linear relation is added via applyRelation(), ensure that enableGroupDistinct() is called.

The specific queries called aren't part if our public API, only the data returned by them. So I don't think this will be a breaking change. We'd need to test that, of course.

chillu commented 7 years ago

@tractorcow Is this an API change? It's currently assigned against 4.0.0 stable

tractorcow commented 7 years ago

It will result in API breakages as many parts of the ORM rely on distinct to enforce unique rows. e.g. left join on one_to_many relations.

chillu commented 7 years ago

@tractorcow OK moving this out of the 4.0 milestone, doesn't seem realistic at this point.

tractorcow commented 6 years ago

In 4.0 we have made a lot of effort in reducing the need for distinct. For instance,

public function applyRelation($relation, $linearOnly = false)

$linearOnly = true will ensure that distinct isn't necessary.

In 5.x I suggest switching distinct to false by default and seeing if we can live with that as the new status quo.

DataQuery::initialiseQuery() calls setDistinct(true), which should be off instead.

stojg commented 5 years ago

As a SRE/DevOps I am in very much in favour of removing as many DISTINCTs as possible since they often (always?) causes temporary tables in at least MySQL:

Example from a DESCRIBE of a random query with distinct:

+----+-------------+---------------+--------+---------------------+-------------+---------+----------------------------+-------+-----------------------------------------------------+
| id | select_type | table         | type   | possible_keys       | key         | key_len | ref                        | rows  | Extra                                               |
+----+-------------+---------------+--------+---------------------+-------------+---------+----------------------------+-------+-----------------------------------------------------+
|  1 | SIMPLE      | BlogPost_Live | range  | PRIMARY,PublishDate | PublishDate | 6       | NULL                       | 28702 | Using index condition; Using where; Using temporary |
|  1 | SIMPLE      | SiteTree_Live | eq_ref | PRIMARY,ClassName   | PRIMARY     | 4       | SS_mysite.BlogPost_Live.ID |     1 | Using where                                         |
|  1 | SIMPLE      | Page_Live     | eq_ref | PRIMARY             | PRIMARY     | 4       | SS_mysite.BlogPost_Live.ID |     1 | NULL                                                |
+----+-------------+---------------+--------+---------------------+-------------+---------+----------------------------+-------+-----------------------------------------------------+

Without distinct:

+----+-------------+---------------+--------+---------------------+-------------+---------+----------------------------+-------+------------------------------------+
| id | select_type | table         | type   | possible_keys       | key         | key_len | ref                        | rows  | Extra                              |
+----+-------------+---------------+--------+---------------------+-------------+---------+----------------------------+-------+------------------------------------+
|  1 | SIMPLE      | BlogPost_Live | range  | PRIMARY,PublishDate | PublishDate | 6       | NULL                       | 28702 | Using index condition; Using where |
|  1 | SIMPLE      | SiteTree_Live | eq_ref | PRIMARY,ClassName   | PRIMARY     | 4       | SS_mysite.BlogPost_Live.ID |     1 | Using where                        |
|  1 | SIMPLE      | Page_Live     | eq_ref | PRIMARY             | PRIMARY     | 4       | SS_mysite.BlogPost_Live.ID |     1 | NULL                               |
+----+-------------+---------------+--------+---------------------+-------------+---------+----------------------------+-------+------------------------------------+

Notice how the top row loses the Using temporary in the Extra column

From the MySQL docs:

Using temporary

To resolve the query, MySQL needs to create a temporary table to hold the result. This typically > happens if the query contains GROUP BY and ORDER BY clauses that list columns differently. Source: http://ftp.nchu.edu.tw/MySQL/doc/refman/5.0/en/using-explain.html

(sorry about the close/repoen of the issue, it was an tab-enter mistake)

sminnee commented 5 years ago

DISTINCT could be removed from queries that don't add 1-to-many joins, notably those that don't use the dot-syntax in filters/sorts that reference has_many or many_many relations.

A lot of queries would be covered by this restriction, so it would be a useful optimisation.

Additionally, we could replace DISTINCT by GROUP BY [all, the fields] if we wanted, but would that actually help?

Finally, the filters/sorts on related data could potentially be refactored into subqueries, but again, I don't know if a subquery would have any benefit there.

If @stojg or others have any views on those 2nd questions, please post :-)

stojg commented 5 years ago

I don't think a group by would help.. the following is sort of how it works according to MySQL

Evaluation of statements that contain an ORDER BY clause and a different GROUP BY clause, or for which the ORDER BY or GROUP BY contains columns from tables other than the first table in the join queue.

Evaluation of DISTINCT combined with ORDER BY may require a temporary table.

https://dev.mysql.com/doc/refman/5.6/en/internal-temporary-tables.html

Subqueries could possibly work, but as per above documentation.. so it is a rather tricky thing

Tables created for subquery or semi-join materialization (see Section 8.2.2, “Optimizing Subqueries, Derived Tables, and Views”).

tractorcow commented 5 years ago

@sminnee you can see in DataQuery::applyRelation() I have added a flag $linearOnly which causes an error in case a join is added that would require a distinct.

You could possibly refactor this logic to conditionally add DISTINCT to the query as needed, so it would silently self-optimise if you only ever added linear joins, but would add distinct when joining on many_many tables.

At the moment you can ONLY sort on linear relations. It's just a matter of applying the same logic to conditions as well. ;)

All we really need is the free performance when we know it's safe.

sminnee commented 5 years ago

At the moment you can ONLY sort on linear relations.

That may mean that this wouldn't work:

Group::get()->sort('Member.Count()')

Moving from DISTINCT to GROUP BY 1,2,3,4,5... might help with that.

sminnee commented 5 years ago

My recommendation for this, as a minor release change:

Set DISTINCT to false by default
Add a helper method enableGroupedDistinct() that adds a clause along the lines of GROUP BY 1,2,3,4,5.... This will be preferred over the DISTINCT clause
Whenever a non-linear relation is added via applyRelation(), ensure that enableGroupDistinct() is called.

The specific queries called aren't part if our public API, only the data returned by them. So I don't think this will be a breaking change. We'd need to test that, of course.

tractorcow commented 5 years ago

I feel like a method that adds a join and a group-by in a single atomic action would be a great idea. An aggregate API (such as Member.Count()) could piggy-back off this grouped join. Something like ->joinAggregation()? Join on table, add an aggregate column, group by other selected columns. :P

nfauchelle commented 3 years ago

Ran into this issue while I was trying to work out why some queries were taking a while.

If I hack the core to turn off DISTINCT by default it makes a big different, queries I tested (which can be simple Page::get()->filter(on single item)->limit(x)) types) are 7-8 times faster without the Distinct.

I can't seem to work out a nice way to turn this off per query - looks like an all or nothing approach via an extension.

Is this still being pursued?

silverstripe / silverstripe-framework

"SELECT DISTINCT" on all queries is a performance sucker #4923

Possible implementation (from Sam)