tidyverse / dplyr

dplyr: A grammar of data manipulation
https://dplyr.tidyverse.org/
Other
4.75k stars 2.12k forks source link

dbplyr groupwise mutate does not accept min/max/mean/etc... window functions #2960

Closed ericgtaylor closed 7 years ago

ericgtaylor commented 7 years ago

Sticking my neck out here so please be kind if this is naive or inappropriate for this forum. :)

In dbplyr I noticed that group_by() %>% mutate() statements do not allow basic aggregating window functions, such as min/max/mean/median. An example is below. I know that I can use these window functions in summarize() and then join to achieve the same effect, but TBH group_by() %>% mutate() is one of my favorite use cases for dplyr and would be anxious to have this feature in a future version. That's it! Thanks for listening.


> con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
> copy_to(con, mtcars)
> 
> mtcars2 <- tbl(con, "mtcars")
> mtcars2 %>%
+     select(mpg, cyl) %>%
+     group_by(cyl) %>%
+     mutate(mpg_min = min(mpg))
Error: Window function `min()` is not supported by this database
JohnMount commented 7 years ago

I see the same issue, and think your code is correct.

The following is just a work-around, but here is a new user defined verb. This new user defined verb groups and summarizes data and then joins the result back into the original data frame. I like it a bit more than the group_by %>% mutate pattern as I think it better documents intent to new users. I've tested it on in-memory data frames and on a SQLite data frame.

suppressPackageStartupMessages(library("dplyr"))

#' Simulate the group_by/mutate pattern with an explicit summarize and join.
#' 
#' Group a data frame by the groupingVars and compute user summaries on 
#' this data frame (user summaries specified in ...), then join these new
#' columns back into the original data and return to the user.
#' This works around https://github.com/tidyverse/dplyr/issues/2960 .
#' And it is a demonstration of a higher-order dplyr verb.
#' Author: John Mount, Win-Vector LLC.
#' 
#' @param d data.frame
#' @param groupingVars character vector of column names to group by.
#' @param ... list of dplyr::mutate() expressions.
#' @value d with grouped summaries added as extra columns
#' 
#' @examples
#' 
#' add_group_summaries(mtcars, 
#'                     c("cyl", "gear"), 
#'                     group_mean_mpg = mean(mpg), 
#'                     group_mean_disp = mean(disp)) %>%
#'   head()
#' 
#' @export
#' 
add_group_summaries <- function(d, groupingVars, ...) {
  # convert char vector into quosure vector
  # worked out after reading http://dplyr.tidyverse.org/articles/programming.html
  # no idea if this will be stable across rlang/tidyeval versions
  groupingQuos <- lapply(groupingVars, 
                         function(si) { quo(!!as.name(si)) })
  # print(groupingQuos)
  dg <- group_by(d, !!!groupingQuos)
  ds <- summarize(dg, ...)
  # Work around: https://github.com/tidyverse/dplyr/issues/2963
  ds <- ungroup(ds)
  left_join(d, ds, by= groupingVars)
}

mtcars %>% 
  add_group_summaries(c("cyl", "gear"), 
                      group_mean_mpg = mean(mpg), 
                      group_mean_disp = mean(disp)) %>%
  head()

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, mtcars)
mtcars2 <- tbl(con, "mtcars")

mtcars2 %>% 
  add_group_summaries(c("cyl", "gear"), 
                      group_mean_mpg = mean(mpg), 
                      group_mean_disp = mean(disp)) %>%
  head()
hadley commented 7 years ago

The error message seems pretty clear to me - your database doesn't support window functions. I don't think there's much dbplyr can do about this so you can either switch to a more powerful database or use John's work around.

ericgtaylor commented 7 years ago

So the db actually does support the window function (see below), but the group_by() %>% mutate() workflow requires a subquery and a join, which seemed easy enough to program into dplyr. I was curious to know if that omission was intentional. Of course it's also quite easy to write a UDF to take care of that apparently... fair enough. :)

> copy_to(con, mtcars)
> mtcars2 <- tbl(con, "mtcars")
> mtcars2 %>%
+     select(mpg, cyl) %>%
+     group_by(cyl) %>%
+     summarise(mpg_min = min(mpg))
# Source:   lazy query [?? x 2]
# Database: sqlite 3.19.3 [:memory:]
    cyl mpg_min
  <dbl>   <dbl>
1     4    21.4
2     6    17.8
3     8    10.4
JohnMount commented 7 years ago

@ericgtaylor , I figured from your first note you already knew how to write the join-style queries (and still figure you do). And I hope I did not come of as presumptuous writing a wrapper function (the novel bit of that being what code is required to re-encode the column names for both dplyr calls). Obviously if you are writing an actual analysis you don't need my function as it is easy to type the column controls directly into the group_by and the join steps when you write your pipeline.

I don't think "switch to a more powerful database" is what anyone wants to hear in this context. And the error message I think would have resolved this quicker would be more like "Error: OVER (PARTITION BY ...) syntax is not supported by this database."

But some of the details become clear if we use dplyr::show_query() to see the SQL dplyr would emit to the PostgreSQL database. The mutate() example uses the "OVER (PARTITION BY ...)" syntax, which is what I presume SQLite does not support (I tried pasting it into a DBI::dbGetQuery() and it reports a syntax error).

And your point is in spirit (though not technically) correct: SQLite does have an aggregation function named MIN(), and we can see it being used in the summarize example. But as we have seen this isn't the SQL code dplyr uses in the mutate example (where it uses the window or partitioned version of "min").

suppressPackageStartupMessages(library("dplyr"))
library('RPostgreSQL')
#> Loading required package: DBI

# Local Postgres.app database; no password by default
# Of course, you fill in your own database information here.
con = DBI::dbConnect(DBI::dbDriver("PostgreSQL"),
                     host = 'localhost',
                     port = 5432,
                     user = 'postgres',
                     password = 'pg')

copy_to(con, 
        mtcars %>% mutate(id = row_number()), 
        'mtcars')
mtcars2 <- tbl(con, "mtcars")

mtcars2 %>%
  select(id, mpg, cyl) %>%
  group_by(cyl) %>%
  mutate(mpg_min = min(mpg)) %>% 
  show_query()
#> 
#> SELECT "id", "mpg", "cyl", min("mpg") OVER (PARTITION BY "cyl") AS "mpg_min"
#> FROM (SELECT "id" AS "id", "mpg" AS "mpg", "cyl" AS "cyl"
#> FROM "mtcars") "kkztfjtpuw"

mtcars2 %>%
  select(id, mpg, cyl) %>%
  group_by(cyl) %>%
  summarize(mpg_min = min(mpg)) %>% 
  show_query()
#> 
#> SELECT "cyl", MIN("mpg") AS "mpg_min"
#> FROM (SELECT "id" AS "id", "mpg" AS "mpg", "cyl" AS "cyl"
#> FROM "mtcars") "pdtcvcoqju"
#> GROUP BY "cyl"

This leaves open the question: is group_by() %>% mutate() a standard dplyr idiom? I thought that it was. But it seems it is not with RSQLite.

hadley commented 7 years ago

@ericgtaylor that's an aggregate function, not a window function.

In general, it's not easy to translate such statements to joins because they can include both group and row values:

mtcars %>%
  group_by(cyl) %>%
  transmute(mpg_z = (mpg - mean(mpg)) / sd(mpg))
ericgtaylor commented 7 years ago

Got it. Hadley's last response makes sense. Thanks.

I also like John's suggestion for the alternate OVER (PARTITION BY ...) error. Wasn't familiar with the window function terminology, and the first google search result for "SQL window function" refers to aggregates as a sub-type of window functions. The second result does not. Did you guys know the internet has conflicting information?! haha...