mlverse / pysparklyr

Extension to {sparklyr} that allows you to interact with Spark & Databricks Connect
https://spark.posit.co/deployment/databricks-connect.html
Other
14 stars 3 forks source link

Support dbWriteTable() #94

Open blairj09 opened 11 months ago

blairj09 commented 11 months ago

Currently when trying to write local data to Databricks, the following is observed:

> library(sparklyr)
> sc <- spark_connect(method = "databricks_connect", cluster_id = "*************")
! Changing host URL to: ****************
  Set `host_sanitize = FALSE` in `spark_connect()` to avoid changing it
✔ Retrieving info for cluster:'*************' [313ms]
✔ Using the 'r-sparklyr-databricks-14.0' Python environment 
  Path: /home/james/.virtualenvs/r-sparklyr-databricks-14.0/bin/python
✔ Connecting to 'Test Cluster' (DBR '14.0') [470ms]
> DBI::dbWriteTable(sc, "demos.testing.foo", mtcars, overwrite = TRUE)
Error in UseMethod("invoke") : 
  no applicable method for 'invoke' applied to an object of class "list"
``` > sessionInfo() R version 4.3.1 (2023-06-16) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.3 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0 locale: [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 LC_COLLATE=C.UTF-8 [5] LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8 LC_PAPER=C.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C time zone: Etc/UTC tzcode source: system (glibc) attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] DBI_1.1.3 pysparklyr_0.1.2.9000 sparklyr_1.8.4 loaded via a namespace (and not attached): [1] Matrix_1.6-1.1 jsonlite_1.8.7 dplyr_1.1.3 compiler_4.3.1 tidyselect_1.2.0 [6] Rcpp_1.0.11 parallel_4.3.1 tidyr_1.3.0 png_0.1-8 uuid_1.1-1 [11] yaml_2.3.7 reticulate_1.34.0 lattice_0.21-9 R6_2.5.1 generics_0.1.3 [16] curl_5.1.0 httr2_0.2.3 knitr_1.44 tibble_3.2.1 openssl_2.1.1 [21] pillar_1.9.0 rlang_1.1.1 utf8_1.2.3 xfun_0.40 config_0.3.2 [26] fs_1.6.3 cli_3.6.1 magrittr_2.0.3 ps_1.7.5 grid_4.3.1 [31] processx_3.8.2 rstudioapi_0.15.0 dbplyr_2.3.4 rappdirs_0.3.3 askpass_1.2.0 [36] lifecycle_1.0.3 vctrs_0.6.3 glue_1.6.2 fansi_1.0.4 purrr_1.0.2 [41] httr_1.4.7 tools_4.3.1 pkgconfig_2.0.3 ```
edgararuiz commented 11 months ago

This will require an entire new DBI back-end for pysparklyr objects, not something I'd like to start this close to release time.

tnederlof commented 1 week ago

This request is coming up for users at a customer. Since so many of them are used to using dbWriteTable it would really help them onboard into using clusters from Workbench.

In the meantime I suggested they do something like:

random_df <- tibble::tibble("A" = rep(1,5,1), "B" = rep(1,5,1))

spark_tbl_random_df <- copy_to(sc, random_df, "spark_random_df")

spark_tbl_random_df  %>%
  spark_write_table(
    name = I("demo.default.random_df"),
    mode = "overwrite"
  )