Open tractorcow opened 8 years ago
@tractorcow Is this an API change? It's currently assigned against 4.0.0 stable
It will result in API breakages as many parts of the ORM rely on distinct to enforce unique rows. e.g. left join on one_to_many relations.
@tractorcow OK moving this out of the 4.0 milestone, doesn't seem realistic at this point.
In 4.0 we have made a lot of effort in reducing the need for distinct. For instance,
public function applyRelation($relation, $linearOnly = false)
$linearOnly = true will ensure that distinct isn't necessary.
In 5.x I suggest switching distinct to false by default and seeing if we can live with that as the new status quo.
DataQuery::initialiseQuery() calls setDistinct(true), which should be off instead.
As a SRE/DevOps I am in very much in favour of removing as many DISTINCT
s as possible since they often (always?) causes temporary tables in at least MySQL:
Example from a DESCRIBE of a random query with distinct:
+----+-------------+---------------+--------+---------------------+-------------+---------+----------------------------+-------+-----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+--------+---------------------+-------------+---------+----------------------------+-------+-----------------------------------------------------+
| 1 | SIMPLE | BlogPost_Live | range | PRIMARY,PublishDate | PublishDate | 6 | NULL | 28702 | Using index condition; Using where; Using temporary |
| 1 | SIMPLE | SiteTree_Live | eq_ref | PRIMARY,ClassName | PRIMARY | 4 | SS_mysite.BlogPost_Live.ID | 1 | Using where |
| 1 | SIMPLE | Page_Live | eq_ref | PRIMARY | PRIMARY | 4 | SS_mysite.BlogPost_Live.ID | 1 | NULL |
+----+-------------+---------------+--------+---------------------+-------------+---------+----------------------------+-------+-----------------------------------------------------+
Without distinct:
+----+-------------+---------------+--------+---------------------+-------------+---------+----------------------------+-------+------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+--------+---------------------+-------------+---------+----------------------------+-------+------------------------------------+
| 1 | SIMPLE | BlogPost_Live | range | PRIMARY,PublishDate | PublishDate | 6 | NULL | 28702 | Using index condition; Using where |
| 1 | SIMPLE | SiteTree_Live | eq_ref | PRIMARY,ClassName | PRIMARY | 4 | SS_mysite.BlogPost_Live.ID | 1 | Using where |
| 1 | SIMPLE | Page_Live | eq_ref | PRIMARY | PRIMARY | 4 | SS_mysite.BlogPost_Live.ID | 1 | NULL |
+----+-------------+---------------+--------+---------------------+-------------+---------+----------------------------+-------+------------------------------------+
Notice how the top row loses the Using temporary
in the Extra
column
From the MySQL docs:
Using temporary
To resolve the query, MySQL needs to create a temporary table to hold the result. This typically > happens if the query contains GROUP BY and ORDER BY clauses that list columns differently. Source: http://ftp.nchu.edu.tw/MySQL/doc/refman/5.0/en/using-explain.html
(sorry about the close/repoen of the issue, it was an tab-enter mistake)
DISTINCT could be removed from queries that don't add 1-to-many joins, notably those that don't use the dot-syntax in filters/sorts that reference has_many or many_many relations.
A lot of queries would be covered by this restriction, so it would be a useful optimisation.
Additionally, we could replace DISTINCT by GROUP BY [all, the fields] if we wanted, but would that actually help?
Finally, the filters/sorts on related data could potentially be refactored into subqueries, but again, I don't know if a subquery would have any benefit there.
If @stojg or others have any views on those 2nd questions, please post :-)
I don't think a group by would help.. the following is sort of how it works according to MySQL
Evaluation of statements that contain an ORDER BY clause and a different GROUP BY clause, or for which the ORDER BY or GROUP BY contains columns from tables other than the first table in the join queue.
Evaluation of DISTINCT combined with ORDER BY may require a temporary table.
https://dev.mysql.com/doc/refman/5.6/en/internal-temporary-tables.html
Subqueries could possibly work, but as per above documentation.. so it is a rather tricky thing
Tables created for subquery or semi-join materialization (see Section 8.2.2, “Optimizing Subqueries, Derived Tables, and Views”).
@sminnee you can see in DataQuery::applyRelation() I have added a flag $linearOnly which causes an error in case a join is added that would require a distinct.
You could possibly refactor this logic to conditionally add DISTINCT
to the query as needed, so it would silently self-optimise if you only ever added linear joins, but would add distinct when joining on many_many tables.
At the moment you can ONLY sort on linear relations. It's just a matter of applying the same logic to conditions as well. ;)
All we really need is the free performance when we know it's safe.
At the moment you can ONLY sort on linear relations.
That may mean that this wouldn't work:
Group::get()->sort('Member.Count()')
Moving from DISTINCT
to GROUP BY 1,2,3,4,5...
might help with that.
My recommendation for this, as a minor release change:
enableGroupedDistinct()
that adds a clause along the lines of GROUP BY 1,2,3,4,5...
. This will be preferred over the DISTINCT clauseapplyRelation()
, ensure that enableGroupDistinct()
is called.The specific queries called aren't part if our public API, only the data returned by them. So I don't think this will be a breaking change. We'd need to test that, of course.
I feel like a method that adds a join and a group-by in a single atomic action would be a great idea. An aggregate API (such as Member.Count()
) could piggy-back off this grouped join. Something like ->joinAggregation()
? Join on table, add an aggregate column, group by other selected columns. :P
Ran into this issue while I was trying to work out why some queries were taking a while.
If I hack the core to turn off DISTINCT by default it makes a big different, queries I tested (which can be simple Page::get()->filter(on single item)->limit(x)
) types) are 7-8 times faster without the Distinct.
I can't seem to work out a nice way to turn this off per query - looks like an all or nothing approach via an extension.
Is this still being pursued?
I'd rather address and resolve all instances where using distinct is absolutely necessary, and resolve those. Using distinct on all queries to solve poorly formed queries is lazy.
Possible implementation (from Sam)
My recommendation for this, as a minor release change:
enableGroupedDistinct()
that adds a clause along the lines ofGROUP BY 1,2,3,4,5...
. This will be preferred over the DISTINCT clauseapplyRelation()
, ensure thatenableGroupDistinct()
is called.The specific queries called aren't part if our public API, only the data returned by them. So I don't think this will be a breaking change. We'd need to test that, of course.