tidyverse / dbplyr

Database (DBI) backend for dplyr
https://dbplyr.tidyverse.org
Other
474 stars 173 forks source link

Supporting persisted tables for Spark SQL backend #1502

Closed zacdav-db closed 3 months ago

zacdav-db commented 4 months ago

Spark SQL (in this case, against Databricks) should be able to support non-temporary writes, currently this errors like so:

> results <- tbl(con, I("samples.nyctaxi.trips")) %>%
+   group_by(pickup_zip) %>%
+   summarise(avg_trip_dist = mean(trip_distance))

> compute(results, I("zacdav.default.avg_trip_dist"), temporary = FALSE)
Error in `db_compute()`:
! Spark SQL only support temporary tables
Run `rlang::last_trace()` to see where the error occurred.
Warning message:
Missing values are always removed in SQL aggregation functions.
Use `na.rm = TRUE` to silence this warning
This warning is displayed once every 8 hours. 

> rlang::last_trace(drop = FALSE)
<error/rlang_error>
Error in `db_compute()`:
! Spark SQL only support temporary tables
---
Backtrace:
    ▆
 1. ├─dplyr::compute(results, I("zacdav.default.avg_trip_dist"), temporary = FALSE)
 2. └─dbplyr:::compute.tbl_sql(...)
 3.   ├─dbplyr::db_compute(...)
 4.   └─dbplyr:::`db_compute.Spark SQL`(...)
 5.     └─cli::cli_abort("Spark SQL only support temporary tables")
 6.       └─rlang::abort(...)

It looks like the following functions likely need to be adjusted:

zacdav-db commented 3 months ago

This is now resolved via #1514