tidyverse / dbplyr

Database (DBI) backend for dplyr
https://dbplyr.tidyverse.org
Other
477 stars 173 forks source link

[question] expected output of `x %>% dbplyr::compute() %>% dbplyr::remote_name()` #639

Closed yitao-li closed 2 years ago

yitao-li commented 3 years ago

For example, consider

df1 <- memdb_frame(x = 1)
df2 <- df1 %>% mutate(y = 0)
dfc <- df2 %>% compute()
print(dfc %>% remote_name())

With dbplyr version 2.1.1 the output is NULL, but I seem to remember some earlier version of dbplyr would produce a non-NULL output, and also based on how I understood the dbplyr documentation, a non-NULL output is expected, right?

BTW the aforementioned NULL / non-NULL outputs also apply to Spark dataframes in sparklyr or other DB backends, not just memdb_frames.

yitao-li commented 3 years ago

In particular, now if I run the following with sparklyr:

library(sparklyr)

sc <- spark_connect(master = "local")
sdf <- sdf_len(sc, 5)
print(dbplyr::remote_name(sdf))
sdf <- sdf %>% dplyr::compute()
print(dbplyr::remote_name(sdf))

I see the table had a remote name to begin with, but then after dplyr::compute() the remote name is not there any more, which seems counter-intuitive, and I don't think this happened with an earlier version of dbplyr.

yitao-li commented 3 years ago

And same with the memdb_frame example too (i.e., if I simplify the example I mentioned in the issue to something matching the sparklyr example from the above):

library(dplyr)
library(dbplyr)

df <- memdb_frame(x = 1)
print(df %>% remote_name()) # has remote name
df <- df %>% compute()
print(df %>% remote_name()) # remote name becomes NULL now :/
yitao-li commented 3 years ago

^^ but the examples above do return a non-NULL remote name after compute() with dbplyr 2.0, and I believe there was some change in dbplyr 2.1.x that caused the NULL output

yitao-li commented 3 years ago

Proposed change: https://github.com/tidyverse/dbplyr/pull/649 (i.e., I think remote_name() should ignore the "trailing" bunch of op_ungroup operations -- let me know if it makes sense)

mgirlich commented 3 years ago

I think it would make more sense to make remote_name() a generic. And then also op_group_by should be ignored. For example:

remote_name <- function(x) {
  UseMethod("remote_name")
}

#' @export
remote_name.tbl_lazy <- function(x) {
  remote_name(x$ops)
}

#' @export
remote_name.op_base <- function(x) {
  x$x
}

#' @export
remote_name.op_group_by <- function(x) {
  remote_name(x$x)
}

#' @export
remote_name.op_ungroup <- function(x) {
  remote_name(x$x)
}

#' @export
remote_name.default <- function(x) {
  return()
}
yitao-li commented 3 years ago

@mgirlich That's a great idea! I have revised my PR accordingly.

krlmlr commented 3 years ago

FWIW it looks like this was introduced in 43559165cb051b3a2f1a9a1f4abfc05c32f8d020 (#616).