tomjaguarpaw / haskell-opaleye

Other
599 stars 115 forks source link

Reliance on undefined behavior of ORDER BY #538

Open bitc opened 2 years ago

bitc commented 2 years ago

This is a copy of the issue that I opened against the rel8 repo here: https://github.com/circuithub/rel8/issues/151


Hello, this library is really cool. The best feature is that it can return trees of data instead of just tables.

But I noticed that the SQL it generates relies on undefined (or unspecified?) behavior of PostgreSQL, with regards to ORDER BY and LIMIT and ARRAY_AGG.

Here is an example query with ORDER BY and LIMIT:

SELECT name
FROM author
ORDER BY name
LIMIT 5

If this is changed to:

SELECT name
FROM
    (SELECT *
    FROM author
    ORDER BY name
    ) T1
LIMIT 5

then this is no longer the same query. Even though the inner SELECT has an ORDER BY clause, once it is selected from by the outer select, it is treated as an unordered relation, and the rows may result in any order. So the LIMIT clause will choose an arbitrary 5 rows. In practice, it seems that PostgreSQL will always returned the "expected" results for this query, but it is not guaranteed. The general rule of thumb is that LIMIT clause only makes sense when attached directly to an ORDER BY clause

And now for ARRAY_AGG there is a similar issue.

Here is an example query:

SELECT ARRAY_AGG(name)
FROM
    (SELECT *
    FROM author
    ORDER BY name
    ) T1;

This query doesn't do what we want. PostgreSQL aggregate functions aren't influenced by the ORDER BY clause that we have here, and so the result can be that the elements will be in any arbitrary order. The correct version of this query is this:

SELECT ARRAY_AGG(name ORDER BY name)
FROM
    (SELECT *
    FROM author
    ) T1;

If we want a LIMIT clause then we actually need two ORDER BY clauses like this:

SELECT ARRAY_AGG(name ORDER BY name)
FROM
    (SELECT *
    FROM author
    ORDER BY name
    LIMIT 5
    ) T1;

Also in these examples, even with the "incorrect" ARRAY_AGG query, it seems that in practice PostgreSQL will always return the "expected" result, but again this is not guaranteed.

I am not sure when these types of "incorrect" queries are likely to give incorrect results in practice (it may be never), but a good guess is when the queries become very complicated (go beyond the from_collapse_limit) or involve parallel scans.

tomjaguarpaw commented 2 years ago

Interesting, thank you for reporting this. So is it true that attaching ORDER BY to a query is redundant unless LIMIT or OFFSET is also provided? In which case I wonder why the standard allows it!

bitc commented 2 years ago

My understanding is that ORDER BY without LIMIT or OFFSET makes sense only:

But there may be other scenarios I'm not aware of. In any case, I would be happy to hear more about all this from a real SQL/PostgreSQL expert.

bitc commented 2 years ago

[...] In which case I wonder why the standard allows it!

After a bit of research, I've discovered that at least MS SQL Server does not allow it, and gives the following error:

The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries,
and common table expressions, unless TOP, OFFSET or FOR XML is also specified.

(TOP is the SQL Server equivalent of LIMIT)

Here is the sqlfiddle I used: http://sqlfiddle.com/#!18/a7540/47886


Finding official documentation about the issues I have raised is tricky, but I've found some stack overflow answers :) like this one:

https://dba.stackexchange.com/questions/82930/database-implementations-of-order-by-in-a-subquery/83500#83500

And in Oracle and PostGres, just because the syntax is supported, does not mean it is obeyed. And just because you observe it as being obeyed in some scenario, does not mean that it will continue to be obeyed as new versions come out or with subtle changes to your data, statistics, the query itself, or the environment.

I can assure you that, without a doubt, if you want a guarantee about order, you need to put the ORDER BY on the outermost query. This should be a doctrine you hold close no matter what platform you're using.

And some more discussion here: https://dba.stackexchange.com/questions/184149/is-it-really-possible-that-the-order-will-not-be-guaranteed-for-this-particular

And also wikipedia, from https://en.wikipedia.org/wiki/Order_by (which sadly does not supply any official references for this claim):

Although some database systems allow the specification of an ORDER BY clause in subselects or view definitions, the presence there has no effect. A view is a logical relational table, and the relational model mandates that a table is a set of rows, implying no sort order whatsoever.

tomjaguarpaw commented 2 years ago

That's very interesting. Thank you for digging further. If someone can come up with a reproducible test case then I will fix this. On the other hand, without a reproducible test case I fear that any such "bug fix" is liable to get undone by accident.