tidyverse / dbplyr

Database (DBI) backend for dplyr
https://dbplyr.tidyverse.org
Other
478 stars 173 forks source link

'arrange' verb before a 'group_by' + 'summarize' produces invalid SQL #622

Open OssiLehtinen opened 3 years ago

OssiLehtinen commented 3 years ago

If an arrange verb is used prior to a group_by + summarize, the 'ORDER BY' in the generated SQL is placed after the GROUP BY and an error is produced in the database.

The following example assume we have a tbl-pointer "tt" to a database table named "table" with columns "a" and "b"

tt %>% 
  arrange(a) %>% 
  group_by(b) %>% 
  summarize(max_a = max(a))

This produces an error (in this case in AWS Athena):

Error: SYNTAX_ERROR: line 4:10: '"a"' must be an aggregate expression or appear in GROUP BY clause

If I check the generated SQL, I get

SELECT "b", MAX("a") AS "max_a"
FROM "table"
GROUP BY "b"
ORDER BY "a"

which obviously does not work.

In my opinion, a better SQL-output for the above dplyr chain would be:

SELECT "b", MAX("a") AS "max_a"
FROM (SELECT a, b FROM "table" ORDER BY "a")
GROUP BY "b"

I want to emphasize, that the opreation itself is somewhat nonsensical, as ordering the rows prior to the aggregation doesn't change the result. However, in the example a formally correct (?) dplyr code produces broken SQL, and this could be an issue in some more useful situation also.

hadley commented 3 years ago

Can you please provide a minimal reprex (reproducible example)? The goal of a reprex is to make it as easy as possible for me to recreate your problem so that I can fix it: please help me help you!

You can see some tips for making reprexes for dbplyr at https://dbplyr.tidyverse.org/articles/reprex.html.

OssiLehtinen commented 3 years ago

Here is an example generated with reprex:

library(dplyr, warn.conflicts = F)
library(dbplyr, warn.conflicts = F)

mf <- memdb_frame(x = 1:5, y = 5:1)
mf %>% 
  arrange(x) %>% 
  group_by(y) %>% 
  summarize(max_x = max(x, na.rm=T)) %>% 
  remote_query()
#> <SQL> SELECT `y`, MAX(`x`) AS `max_x`
#> FROM `dbplyr_001`
#> GROUP BY `y`
#> ORDER BY `x`

Created on 2021-04-02 by the reprex package (v1.0.0)

The curious thing here is, that the memdb-backend doesn't mind the 'outside' 'order by' and the example query runs just fine. However, AWS Athena for example produces the error below. Making an Athena based reprex is difficult, however.

Error: SYNTAX_ERROR: line 4:10: '"x"' must be an aggregate expression or appear in GROUP BY clause

Interestingly, on another dated system with older dplyr and dbplyr the SQL looks what I would expect it to look like, and also what, for example, AWS Athena would be happy with:

library(dplyr, warn.conflicts = F)
library(dbplyr, warn.conflicts = F)

mf <- memdb_frame(x = 1:5, y = 5:1)
mf %>% 
  arrange(x) %>% 
  group_by(y) %>% 
  summarize(max_y = max(y, na.rm=T)) %>% 
  remote_query()
#> <SQL> SELECT `y`, MAX(`y`) AS `max_y`
#> FROM (SELECT *
#> FROM `dbplyr_001`
#> ORDER BY `x`)
#> GROUP BY `y`
hadley commented 3 years ago

(I edited your comment to delete the session info which doesn't provide useful information here, and otherwise makes it harder to see heart of the issue.)

Given that it works in SQLite and not in Athena it seems like this is going to be a database specific issue, and will require some consultation with the SQL spec. My copy of "SQL-99 complete, really" suggests that you can only sort on columns that appear in the result table (which makes sense) so it seems like the previous behaviour is more correct. I suspect this happened during our work to avoid ORDER BY in subqueries, possibly in https://github.com/tidyverse/dbplyr/commit/269cdda12c3d3a51b9799545f9819764f0c387c1#diff-c3647c3e0d307d506b1769d3ebc7ae302060e8582ee5e34477bae96213e38dcf.