Closed krlmlr closed 1 week ago
I don't understand what they should be doing then ?
The same as in dplyr, as usual 🙃
data.frame(a = 1:3) |>
dplyr::mutate(b = sum(a))
#> a b
#> 1 1 6
#> 2 2 6
#> 3 3 6
Created on 2024-05-11 with reprex v2.1.0
Since this error happens at the binding stage, it's not critical for correctness, but might be for performance.
I see, so do the aggregate and recycle the value. I'll look into it.
Thanks, Romain. #89 and the associated PR look more important at this stage.
Having looked at how duckplyr
handles e.g. the regular sum
, this feels more like a translation problem and we need to squeeze in a duckdb$expr_window
so that the duckdb expression gains a OVER ()
library(duckdbrfuns)
con <- duckdbrfuns:::duckdb_con()
duckdb <- asNamespace("duckdb")
experimental <- FALSE
df1 <- data.frame(a = 1:10)
rel1 <- duckdb$rel_from_df(con, df1, experimental = experimental)
rel2 <- duckdb$rel_project(
rel1,
list(
{
tmp_expr <- duckdb$expr_reference("a")
duckdb$expr_set_alias(tmp_expr, "a")
tmp_expr
},
{
tmp_expr <- duckdb$expr_window(
duckdb$expr_function("r_base::sum", list(duckdb$expr_reference("a")))
)
duckdb$expr_set_alias(tmp_expr, "b")
tmp_expr
}
)
)
print(rel2)
#> DuckDB Relation:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Projection [a as a, r_base::sum(a) OVER () as b]
#> r_dataframe_scan(0x10de98358)
#>
#> ---------------------
#> -- Result Columns --
#> ---------------------
#> - a (INTEGER)
#> - b (INTEGER)
duckdb$rel_to_altrep(rel2)
#> a b
#> 1 1 55
#> 2 2 55
#> 3 3 55
#> 4 4 55
#> 5 5 55
#> 6 6 55
#> 7 7 55
#> 8 8 55
#> 9 9 55
#> 10 10 55
Created on 2024-05-21 with reprex v2.1.0
IIUC duckplyr
bases the need for wrapping in a expr_window
based on an allow list:
known_window <- c(
# Window functions
"rank", "rank_dense", "dense_rank", "percent_rank",
"row_number", "first_value", "last_value", "nth_value",
"cume_dist", "lead", "lag", "ntile",
# Aggregates
"sum", "mean", "stddev", "min", "max", "median",
#
NULL
)
Perhaps instead of (in addition to?) that the r_base::
functions could mark that they are aggregates ? i.e. use r_base::aggregate::sum
instead of r_base::sum
so that then duckplyr
can take advantage of that naming convention to squeeze in a expr_window
.
Do we even need the "project" versions of sum()
then? Does your reprex work with "r_base::aggregate::sum"
instead of "r_base::sum"
?
Less code in the extension means less complexity. Sorry about the extra work.
Yeah exactly, we only need the function that aggregate in the extension, as in https://github.com/hannes/duckdb-rfuns/pull/95
This needs then a little hand holding in duckplyr
: https://github.com/duckdblabs/duckplyr/pull/171
Perhaps we don't even need the naming convention, and we can somehow interrogate the extension to find out if a function is a ScalarFunctionSet
or a AggregateFunctionset
but I'm not sure it's worth it.
Indeed the reprex needs r_base::aggregate::sum
. Here's a simplified as we don't need the alias:
library(duckdbrfuns)
con <- duckdbrfuns:::duckdb_con()
duckdb <- asNamespace("duckdb")
experimental <- FALSE
df1 <- data.frame(a = 1:10)
rel1 <- duckdb$rel_from_df(con, df1, experimental = experimental)
rel2 <- duckdb$rel_project(
rel1,
list(
duckdb$expr_reference("a"),
duckdb$expr_window(
duckdb$expr_function("r_base::aggregate::sum", list(duckdb$expr_reference("a")))
)
)
)
duckdb$rel_to_altrep(rel2)
#> a r_base::aggregate::sum(a) OVER ()
#> 1 1 55
#> 2 2 55
#> 3 3 55
#> 4 4 55
#> 5 5 55
#> 6 6 55
#> 7 7 55
#> 8 8 55
#> 9 9 55
#> 10 10 55
Created on 2024-05-21 with reprex v2.1.0
Thanks, confirmed with most recent duckdb, without having to change the names:
library(duckdb)
#> Loading required package: DBI
drv <- duckdb()
con <- dbConnect(drv)
duckdb <- asNamespace("duckdb")
duckdb$rapi_load_rfuns(drv@database_ref)
experimental <- FALSE
df1 <- data.frame(a = 1:3)
rel1 <- duckdb$rel_from_df(con, df1, experimental = experimental)
rel2 <- duckdb$rel_project(
rel1,
list(
duckdb$expr_reference("a"),
duckdb$expr_window(
duckdb$expr_function("r_base::sum", list(duckdb$expr_reference("a")))
)
)
)
duckdb$rel_to_altrep(rel2)
#> a r_base::sum(a) OVER ()
#> 1 1 6
#> 2 2 6
#> 3 3 6
Created on 2024-05-22 with reprex v2.1.0
Looks like this issue is moot. I need to trace it back to the original duckplyr problem.
With duckdb v0.10.2:
Created on 2024-05-05 with reprex v2.1.0