hannes / duckdb-rfuns

R functions for duckdb
MIT License
8 stars 0 forks source link

Need non-aggregating versions of `sum()`, `min()`, and `max()` #90

Closed krlmlr closed 1 week ago

krlmlr commented 3 weeks ago

With duckdb v0.10.2:

duckdb <- asNamespace("duckdb")
drv <- duckdb::duckdb()
con <- DBI::dbConnect(drv)
experimental <- FALSE
invisible(duckdb$rapi_load_rfuns(drv@database_ref))
df1 <- data.frame(a = 1)

rel1 <- duckdb$rel_from_df(con, df1, experimental = experimental)
rel2 <- duckdb$rel_project(
  rel1,
  list(
    {
      tmp_expr <- duckdb$expr_reference("a")
      duckdb$expr_set_alias(tmp_expr, "a")
      tmp_expr
    },
    {
      tmp_expr <- duckdb$expr_function("r_base::max", list(duckdb$expr_reference("a")))
      duckdb$expr_set_alias(tmp_expr, "b")
      tmp_expr
    }
  )
)
#> Error: {"exception_type":"Binder","exception_message":"Aggregates cannot be present in a Project relation!"}

Created on 2024-05-05 with reprex v2.1.0

romainfrancois commented 3 weeks ago

I don't understand what they should be doing then ?

krlmlr commented 3 weeks ago

The same as in dplyr, as usual 🙃

data.frame(a = 1:3) |>
  dplyr::mutate(b = sum(a))
#>   a b
#> 1 1 6
#> 2 2 6
#> 3 3 6

Created on 2024-05-11 with reprex v2.1.0

krlmlr commented 3 weeks ago

Since this error happens at the binding stage, it's not critical for correctness, but might be for performance.

romainfrancois commented 2 weeks ago

I see, so do the aggregate and recycle the value. I'll look into it.

krlmlr commented 2 weeks ago

Thanks, Romain. #89 and the associated PR look more important at this stage.

romainfrancois commented 1 week ago

Having looked at how duckplyr handles e.g. the regular sum, this feels more like a translation problem and we need to squeeze in a duckdb$expr_window so that the duckdb expression gains a OVER ()

library(duckdbrfuns)

con <- duckdbrfuns:::duckdb_con()
duckdb <- asNamespace("duckdb")

experimental <- FALSE
df1 <- data.frame(a = 1:10)

rel1 <- duckdb$rel_from_df(con, df1, experimental = experimental)
rel2 <- duckdb$rel_project(
  rel1,
  list(
    {
      tmp_expr <- duckdb$expr_reference("a")
      duckdb$expr_set_alias(tmp_expr, "a")
      tmp_expr
    },
    {
      tmp_expr <- duckdb$expr_window(
        duckdb$expr_function("r_base::sum", list(duckdb$expr_reference("a")))
      )
      duckdb$expr_set_alias(tmp_expr, "b")
      tmp_expr
    }
  )
)
print(rel2)
#> DuckDB Relation: 
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Projection [a as a, r_base::sum(a) OVER () as b]
#>   r_dataframe_scan(0x10de98358)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - a (INTEGER)
#> - b (INTEGER)
duckdb$rel_to_altrep(rel2)
#>     a  b
#> 1   1 55
#> 2   2 55
#> 3   3 55
#> 4   4 55
#> 5   5 55
#> 6   6 55
#> 7   7 55
#> 8   8 55
#> 9   9 55
#> 10 10 55

Created on 2024-05-21 with reprex v2.1.0

IIUC duckplyr bases the need for wrapping in a expr_window based on an allow list:

https://github.com/romainfrancois/duckplyr/blob/7bc1ccb925ef742cf7805c9391fec19b1233123c/R/relational.R#L242-L252

        known_window <- c(
          # Window functions
          "rank", "rank_dense", "dense_rank", "percent_rank",
          "row_number", "first_value", "last_value", "nth_value",
          "cume_dist", "lead", "lag", "ntile",

          # Aggregates
          "sum", "mean", "stddev", "min", "max", "median",
          #
          NULL
        )

Perhaps instead of (in addition to?) that the r_base:: functions could mark that they are aggregates ? i.e. use r_base::aggregate::sum instead of r_base::sum so that then duckplyr can take advantage of that naming convention to squeeze in a expr_window.

krlmlr commented 1 week ago

Do we even need the "project" versions of sum() then? Does your reprex work with "r_base::aggregate::sum" instead of "r_base::sum" ?

Less code in the extension means less complexity. Sorry about the extra work.

romainfrancois commented 1 week ago

Yeah exactly, we only need the function that aggregate in the extension, as in https://github.com/hannes/duckdb-rfuns/pull/95

This needs then a little hand holding in duckplyr: https://github.com/duckdblabs/duckplyr/pull/171

romainfrancois commented 1 week ago

Perhaps we don't even need the naming convention, and we can somehow interrogate the extension to find out if a function is a ScalarFunctionSet or a AggregateFunctionset but I'm not sure it's worth it.

romainfrancois commented 1 week ago

Indeed the reprex needs r_base::aggregate::sum. Here's a simplified as we don't need the alias:

library(duckdbrfuns)

con <- duckdbrfuns:::duckdb_con()
duckdb <- asNamespace("duckdb")

experimental <- FALSE
df1 <- data.frame(a = 1:10)

rel1 <- duckdb$rel_from_df(con, df1, experimental = experimental)
rel2 <- duckdb$rel_project(
  rel1,
  list(
    duckdb$expr_reference("a"),
    duckdb$expr_window(
      duckdb$expr_function("r_base::aggregate::sum", list(duckdb$expr_reference("a")))
    )
  )
)
duckdb$rel_to_altrep(rel2)
#>     a r_base::aggregate::sum(a) OVER ()
#> 1   1                                55
#> 2   2                                55
#> 3   3                                55
#> 4   4                                55
#> 5   5                                55
#> 6   6                                55
#> 7   7                                55
#> 8   8                                55
#> 9   9                                55
#> 10 10                                55

Created on 2024-05-21 with reprex v2.1.0

krlmlr commented 1 week ago

Thanks, confirmed with most recent duckdb, without having to change the names:

library(duckdb)
#> Loading required package: DBI

drv <- duckdb()
con <- dbConnect(drv)
duckdb <- asNamespace("duckdb")

duckdb$rapi_load_rfuns(drv@database_ref)

experimental <- FALSE
df1 <- data.frame(a = 1:3)

rel1 <- duckdb$rel_from_df(con, df1, experimental = experimental)
rel2 <- duckdb$rel_project(
  rel1,
  list(
    duckdb$expr_reference("a"),
    duckdb$expr_window(
      duckdb$expr_function("r_base::sum", list(duckdb$expr_reference("a")))
    )
  )
)
duckdb$rel_to_altrep(rel2)
#>   a r_base::sum(a) OVER ()
#> 1 1                      6
#> 2 2                      6
#> 3 3                      6

Created on 2024-05-22 with reprex v2.1.0

Looks like this issue is moot. I need to trace it back to the original duckplyr problem.