Closed ericgtaylor closed 7 years ago
I see the same issue, and think your code is correct.
The following is just a work-around, but here is a new user defined verb. This new user defined verb groups and summarizes data and then joins the result back into the original data frame. I like it a bit more than the group_by %>% mutate
pattern as I think it better documents intent to new users. I've tested it on in-memory data frames and on a SQLite data frame.
suppressPackageStartupMessages(library("dplyr"))
#' Simulate the group_by/mutate pattern with an explicit summarize and join.
#'
#' Group a data frame by the groupingVars and compute user summaries on
#' this data frame (user summaries specified in ...), then join these new
#' columns back into the original data and return to the user.
#' This works around https://github.com/tidyverse/dplyr/issues/2960 .
#' And it is a demonstration of a higher-order dplyr verb.
#' Author: John Mount, Win-Vector LLC.
#'
#' @param d data.frame
#' @param groupingVars character vector of column names to group by.
#' @param ... list of dplyr::mutate() expressions.
#' @value d with grouped summaries added as extra columns
#'
#' @examples
#'
#' add_group_summaries(mtcars,
#' c("cyl", "gear"),
#' group_mean_mpg = mean(mpg),
#' group_mean_disp = mean(disp)) %>%
#' head()
#'
#' @export
#'
add_group_summaries <- function(d, groupingVars, ...) {
# convert char vector into quosure vector
# worked out after reading http://dplyr.tidyverse.org/articles/programming.html
# no idea if this will be stable across rlang/tidyeval versions
groupingQuos <- lapply(groupingVars,
function(si) { quo(!!as.name(si)) })
# print(groupingQuos)
dg <- group_by(d, !!!groupingQuos)
ds <- summarize(dg, ...)
# Work around: https://github.com/tidyverse/dplyr/issues/2963
ds <- ungroup(ds)
left_join(d, ds, by= groupingVars)
}
mtcars %>%
add_group_summaries(c("cyl", "gear"),
group_mean_mpg = mean(mpg),
group_mean_disp = mean(disp)) %>%
head()
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, mtcars)
mtcars2 <- tbl(con, "mtcars")
mtcars2 %>%
add_group_summaries(c("cyl", "gear"),
group_mean_mpg = mean(mpg),
group_mean_disp = mean(disp)) %>%
head()
The error message seems pretty clear to me - your database doesn't support window functions. I don't think there's much dbplyr can do about this so you can either switch to a more powerful database or use John's work around.
So the db actually does support the window function (see below), but the group_by() %>% mutate() workflow requires a subquery and a join, which seemed easy enough to program into dplyr
. I was curious to know if that omission was intentional. Of course it's also quite easy to write a UDF to take care of that apparently... fair enough. :)
> copy_to(con, mtcars)
> mtcars2 <- tbl(con, "mtcars")
> mtcars2 %>%
+ select(mpg, cyl) %>%
+ group_by(cyl) %>%
+ summarise(mpg_min = min(mpg))
# Source: lazy query [?? x 2]
# Database: sqlite 3.19.3 [:memory:]
cyl mpg_min
<dbl> <dbl>
1 4 21.4
2 6 17.8
3 8 10.4
@ericgtaylor , I figured from your first note you already knew how to write the join-style queries (and still figure you do). And I hope I did not come of as presumptuous writing a wrapper function (the novel bit of that being what code is required to re-encode the column names for both dplyr
calls). Obviously if you are writing an actual analysis you don't need my function as it is easy to type the column controls directly into the group_by
and the join
steps when you write your pipeline.
I don't think "switch to a more powerful database" is what anyone wants to hear in this context. And the error message I think would have resolved this quicker would be more like "Error: OVER (PARTITION BY ...) syntax is not supported by this database."
But some of the details become clear if we use dplyr::show_query()
to see the SQL
dplyr
would emit to the PostgreSQL
database. The mutate()
example uses the "OVER (PARTITION BY ...)" syntax, which is what I presume SQLite
does not support (I tried pasting it into a DBI::dbGetQuery()
and it reports a syntax error).
And your point is in spirit (though not technically) correct: SQLite
does have an aggregation function named MIN()
, and we can see it being used in the summarize
example. But as we have seen this isn't the SQL
code dplyr
uses in the mutate
example (where it uses the window or partitioned version of "min
").
suppressPackageStartupMessages(library("dplyr")) library('RPostgreSQL') #> Loading required package: DBI # Local Postgres.app database; no password by default # Of course, you fill in your own database information here. con = DBI::dbConnect(DBI::dbDriver("PostgreSQL"), host = 'localhost', port = 5432, user = 'postgres', password = 'pg') copy_to(con, mtcars %>% mutate(id = row_number()), 'mtcars') mtcars2 <- tbl(con, "mtcars") mtcars2 %>% select(id, mpg, cyl) %>% group_by(cyl) %>% mutate(mpg_min = min(mpg)) %>% show_query() #>#> SELECT "id", "mpg", "cyl", min("mpg") OVER (PARTITION BY "cyl") AS "mpg_min" #> FROM (SELECT "id" AS "id", "mpg" AS "mpg", "cyl" AS "cyl" #> FROM "mtcars") "kkztfjtpuw" mtcars2 %>% select(id, mpg, cyl) %>% group_by(cyl) %>% summarize(mpg_min = min(mpg)) %>% show_query() #> #> SELECT "cyl", MIN("mpg") AS "mpg_min" #> FROM (SELECT "id" AS "id", "mpg" AS "mpg", "cyl" AS "cyl" #> FROM "mtcars") "pdtcvcoqju" #> GROUP BY "cyl"
This leaves open the question: is group_by() %>% mutate()
a standard dplyr
idiom? I thought that it was. But it seems it is not with RSQLite
.
@ericgtaylor that's an aggregate function, not a window function.
In general, it's not easy to translate such statements to joins because they can include both group and row values:
mtcars %>%
group_by(cyl) %>%
transmute(mpg_z = (mpg - mean(mpg)) / sd(mpg))
Got it. Hadley's last response makes sense. Thanks.
I also like John's suggestion for the alternate OVER (PARTITION BY ...) error. Wasn't familiar with the window function terminology, and the first google search result for "SQL window function" refers to aggregates as a sub-type of window functions. The second result does not. Did you guys know the internet has conflicting information?! haha...
Sticking my neck out here so please be kind if this is naive or inappropriate for this forum. :)
In
dbplyr
I noticed that group_by() %>% mutate() statements do not allow basic aggregating window functions, such as min/max/mean/median. An example is below. I know that I can use these window functions in summarize() and then join to achieve the same effect, but TBH group_by() %>% mutate() is one of my favorite use cases fordplyr
and would be anxious to have this feature in a future version. That's it! Thanks for listening.