Closed krlmlr closed 3 years ago
Simpler building block from @krlmlr:
memdb_row <- function(...) {
db_row(..., .src = src_memdb())
}
db_row <- function(..., .src) {
data <- tibble(...)
stopifnot(nrow(data) == 1)
values <- unlist(map(data, DBI::dbQuoteLiteral, conn = .src$con))
from <- sql(paste0("SELECT ", paste0(
values, " AS ", DBI::dbQuoteIdentifier(.src$con, names(values)),
collapse = ", "
)))
dbplyr:::tbl.src_dbi(.src, from, vars = names(values))
}
memdb_row(a = 1, b = "x")
#> # Source: SQL [?? x 2]
#> # Database: sqlite 3.35.5 [:memory:]
#> a b
#> <int> <chr>
#> 1 1 x
memdb_row(a = 1, b = "x") %>%
sql_render()
#> <SQL> SELECT 1 AS `a`, 'x' AS `b`
New version "works" (reprex below), but this function is surely pretty awful (it's a de minimis revision of the earlier version, which iterated over columns not rows).
df_to_pg <- function(df, conn) {
temp_starter_sql <- list()
for (i in 1:nrow(df)) {
temp_starter_sql[[i]] = db_row(df[i, ], .src = conn)
}
Reduce(dplyr::union_all, temp_starter_sql)
}
library(farr)
library(dplyr, warn.conflicts = FALSE)
library(DBI)
pg <- dbConnect(RPostgres::Postgres(), bigint = "integer")
cars_pg <- df_to_pg(cars, pg)
cars_pg
#> # Source: lazy query [?? x 2]
#> # Database: postgres [iangow@/tmp:5432/crsp]
#> speed dist
#> <dbl> <dbl>
#> 1 4 2
#> 2 4 10
#> 3 7 4
#> 4 7 22
#> 5 8 16
#> 6 9 10
#> 7 10 18
#> 8 10 26
#> 9 10 34
#> 10 11 17
#> # … with more rows
Created on 2021-05-14 by the reprex package (v2.0.0)
@krlmlr i got the function down to a single line that calls your function. Thanks.
See here.
@krlmlr Unfortunately, Reduce
gives an error like Error: C stack usage 7955336 is too close to the limit
when I try to feed a fairly large data frame (about 10,000 rows). This is probably more than this function should be used for, but rather than editing the part of my "book" where I'm using it I just made a temporary fix of reverting back to the earlier version.
Below is the code prior to reversion. It's your db_row
function, plus as one-line df_to_pg
function that calls it.
df_to_pg <- function(df, conn) {
Reduce(dplyr::union_all, purrr::pmap(df, db_row, .src=conn))
}
db_row <- function(..., .src) {
data <- dplyr::tibble(...)
stopifnot(nrow(data) == 1)
values <- unlist(purrr::map(data, DBI::dbQuoteLiteral, conn = .src))
from <- dbplyr::sql(paste0("SELECT ", paste0(
values, " AS ", DBI::dbQuoteIdentifier(.src, names(values)),
collapse = ", "
)))
dplyr::tbl(.src, from, vars = names(values))
}
Note that if I use purrr::pmap(df, db_row, .src=conn)
alone (i.e., no Reduce
), then it all works fine.
If I replace the last line above with collect(dplyr::tbl(.src, from, vars = names(values)))
, it all works fine.
So it's something about the interaction of Reduce
and a tbl_sql
that doesn't work. Note that I get the same error with purrr::reduce
.
Is there an alternative way to union_all
a list of tables? (I think there might be merit in having a bind_rows
function that works with database tables … though it seems that Reduce(x, union_all)
isn't the best way to do this.)
I think we only need paste0()
with and without collapse = "..."
, and do.call()
and the rlang equivalent. Plus a list()
matrix and t()
. We'd remove stopifnot()
in db_row()
and make it work for more than one row. The unlist()
has to go.
Let me know if this makes sense.
Not sure about the data structure, maybe the main work can be done on a list of character vectors, which can then be transposed with t()
.
@krlmlr I confess I'm a bit confused. Are these (conceptually!) the steps you have in mind?
# Some toy data
df <- tibble::tibble(some_date = seq(as.Date("2020-01-04"), by=1, length.out = 3),
some_num = 2021:2023L,
some_text = c("Hi!", "Huh?", "Bye"))
df
#> # A tibble: 3 x 3
#> some_date some_num some_text
#> <date> <int> <chr>
#> 1 2020-01-04 2021 Hi!
#> 2 2020-01-05 2022 Huh?
#> 3 2020-01-06 2023 Bye
# Step 1.
# Use some function to turn a data frame into some kind
# of matrix. Presumably this would use DBI::dbQuoteLiteral and
# DBI::dbQuoteIdentifier.
# For now, I just hand-coded this step.
df_new <-
tibble::tibble(some_date = paste0("'", df$some_date, "'::date AS some_date"),
some_num = paste0("'", df$some_num, "'::integer AS some_num"),
some_text = paste0("'", df$some_text, "'::text AS some_text"))
df_new <- as.matrix(df_new)
# Step 2.
# Transpose the matrix ...
# I used as_tibble so as to get a list that I could use lapply on (below).
# I have no doubt that there's a better way.
df_transposed <- tibble::as_tibble(t(df_new))
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
#> Using compatibility `.name_repair`.
df_transposed
#> # A tibble: 3 x 3
#> V1 V2 V3
#> <chr> <chr> <chr>
#> 1 '2020-01-04'::date AS som… '2020-01-05'::date AS so… '2020-01-06'::date AS so…
#> 2 '2021'::integer AS some_n… '2022'::integer AS some_… '2023'::integer AS some_…
#> 3 'Hi!'::text AS some_text 'Huh?'::text AS some_text 'Bye'::text AS some_text
# Step 3.
# Make SQL that could be rendered by the database
make_db_row <- function(x) {
paste0("(SELECT ", paste0(x, collapse = ", "), ")")
}
df_row_list <- lapply(df_transposed, make_db_row)
df_sql <- paste0(df_row_list, collapse = "\nUNION ALL\n")
cat(df_sql)
#> (SELECT '2020-01-04'::date AS some_date, '2021'::integer AS some_num, 'Hi!'::text AS some_text)
#> UNION ALL
#> (SELECT '2020-01-05'::date AS some_date, '2022'::integer AS some_num, 'Huh?'::text AS some_text)
#> UNION ALL
#> (SELECT '2020-01-06'::date AS some_date, '2023'::integer AS some_num, 'Bye'::text AS some_text)
# Step 4.
# Render the SQL in the database and return a table
Created on 2021-05-19 by the reprex package (v2.0.0)
Yeah, step 1 is the most tricky one, and also needs a ", "
after all but the last columns. This can be achieved I think with a sequence of map(x, ~ paste(...))
and map2(x, y, ~ paste(...))
calls.
After step 1 we then can do.call(x, paste0)
to efficiently concatenate all the strings for all the rows. Step 2 seems no longer needed. I don't remember why I thought we'd need t()
.
Then, we paste0()
the entire resulting vector and paste(x, collapse = " UNION ALL ")
to get the entire expression. All with vectorized operations, the map()
calls, in the beginning, being the exception but still fast because we're using one paste()
per column only (and not one per row).
df_to_pg <- function(df, conn) {
df <- rowwise(df)
temp <- purrr::pmap(df, db_row, .src=conn)
the_sql <- paste(temp, collapse = "\nUNION ALL\n")
dplyr::tbl(conn, sql(the_sql))
}
db_row <- function(..., .src) {
data <- dplyr::tibble(...)
values <- purrr::map(data, DBI::dbQuoteLiteral, conn = .src)
from <- paste0("SELECT ", paste0(
values, " AS ", DBI::dbQuoteIdentifier(.src, names(values)),
collapse = ", "))
}
The above works, but it seems that I have just moved the stack overflow issue to PostgreSQL:
Error: Failed to prepare query: ERROR: stack depth limit exceeded
HINT: Increase the configuration parameter "max_stack_depth" (currently 2048kB), after ensuring the platform's stack depth limit is adequate.
The original version I made works fine, albeit only with PostgreSQL and it's a bit hacky.
How many rows/cells does the new solution support?
About 6,000 rows from one data set (seems the same against two versions of PostgreSQL: one on a Mac mini M1 and one on Ubuntu 21.04). Mind you, I guess that copy_to
is pretty difficult to beat on PostgreSQL, as PostgreSQL has pretty fast approaches for this.
There is certainly no obvious performance gain from the approach above (though the function seems grossly inefficient), though it's pretty competitive with small tables.
library(DBI)
library(dplyr, warn.conflicts = FALSE)
library(farr)
df_to_pg <- function(df, conn) {
df <- rowwise(df)
temp <- purrr::pmap(df, db_row, .src=conn)
the_sql <- paste(temp, collapse = "\nUNION ALL\n")
dplyr::tbl(conn, sql(the_sql))
}
db_row <- function(..., .src) {
data <- dplyr::tibble(...)
values <- purrr::map(data, DBI::dbQuoteLiteral, conn = .src)
from <- paste0("SELECT ", paste0(
values, " AS ", DBI::dbQuoteIdentifier(.src, names(values)),
collapse = ", "))
}
N <- 6000L
df <- tibble(id = 1:N,
date = seq(as.Date("2020-01-01"), length.out = N, by = 1))
pg <- dbConnect(RPostgres::Postgres(), bigint = "integer")
# Using my "original approach" (github: iangow/farr)
system.time(df_db <- farr::df_to_pg(df, pg))
#> user system elapsed
#> 0.078 0.004 0.090
df_db
#> # Source: SQL [?? x 2]
#> # Database: postgres [iangow@localhost:5432/crsp]
#> id date
#> <int> <date>
#> 1 1 2020-01-01
#> 2 2 2020-01-02
#> 3 3 2020-01-03
#> 4 4 2020-01-04
#> 5 5 2020-01-05
#> 6 6 2020-01-06
#> 7 7 2020-01-07
#> 8 8 2020-01-08
#> 9 9 2020-01-09
#> 10 10 2020-01-10
#> # … with more rows
# Using function above
system.time(df_db <- df_to_pg(df, pg))
#> user system elapsed
#> 3.059 0.026 10.134
df_db
#> # Source: SQL [?? x 2]
#> # Database: postgres [iangow@localhost:5432/crsp]
#> id date
#> <int> <date>
#> 1 1 2020-01-01
#> 2 2 2020-01-02
#> 3 3 2020-01-03
#> 4 4 2020-01-04
#> 5 5 2020-01-05
#> 6 6 2020-01-06
#> 7 7 2020-01-07
#> 8 8 2020-01-08
#> 9 9 2020-01-09
#> 10 10 2020-01-10
#> # … with more rows
# Using copy_to
system.time(df_db <- copy_to(pg, df, overwrite = TRUE))
#> user system elapsed
#> 0.014 0.001 0.031
df_db
#> # Source: table<df> [?? x 2]
#> # Database: postgres [iangow@localhost:5432/crsp]
#> id date
#> <int> <date>
#> 1 1 2020-01-01
#> 2 2 2020-01-02
#> 3 3 2020-01-03
#> 4 4 2020-01-04
#> 5 5 2020-01-05
#> 6 6 2020-01-06
#> 7 7 2020-01-07
#> 8 8 2020-01-08
#> 9 9 2020-01-09
#> 10 10 2020-01-10
#> # … with more rows
Created on 2021-06-11 by the reprex package (v2.0.0)
@krlmlr
For a small data set (n = 10
), (my crude implementation of) your approach is fastest. But it's fast any of the ways used below.
library(DBI)
library(dplyr, warn.conflicts = FALSE)
library(farr)
df_to_pg <- function(df, conn) {
df <- rowwise(df)
temp <- purrr::pmap(df, db_row, .src=conn)
the_sql <- paste(temp, collapse = "\nUNION ALL\n")
dplyr::tbl(conn, sql(the_sql))
}
db_row <- function(..., .src) {
data <- dplyr::tibble(...)
values <- purrr::map(data, DBI::dbQuoteLiteral, conn = .src)
from <- paste0("SELECT ", paste0(
values, " AS ", DBI::dbQuoteIdentifier(.src, names(values)),
collapse = ", "))
}
N <- 10L
df <- tibble(id = 1:N,
date = seq(as.Date("2020-01-01"), length.out = N, by = 1))
pg <- dbConnect(RPostgres::Postgres(), bigint = "integer")
# Using my the approach from farr (github: iangow/farr)
system.time(df_db <- farr::df_to_pg(df, pg))
#> user system elapsed
#> 0.060 0.003 0.066
df_db
#> # Source: SQL [?? x 2]
#> # Database: postgres [iangow@/tmp:5432/crsp]
#> id date
#> <int> <date>
#> 1 1 2020-01-01
#> 2 2 2020-01-02
#> 3 3 2020-01-03
#> 4 4 2020-01-04
#> 5 5 2020-01-05
#> 6 6 2020-01-06
#> 7 7 2020-01-07
#> 8 8 2020-01-08
#> 9 9 2020-01-09
#> 10 10 2020-01-10
# Using function above
system.time(df_db <- df_to_pg(df, pg))
#> user system elapsed
#> 0.013 0.000 0.013
df_db
#> # Source: SQL [?? x 2]
#> # Database: postgres [iangow@/tmp:5432/crsp]
#> id date
#> <int> <date>
#> 1 1 2020-01-01
#> 2 2 2020-01-02
#> 3 3 2020-01-03
#> 4 4 2020-01-04
#> 5 5 2020-01-05
#> 6 6 2020-01-06
#> 7 7 2020-01-07
#> 8 8 2020-01-08
#> 9 9 2020-01-09
#> 10 10 2020-01-10
# Using copy_to
system.time(df_db <- copy_to(pg, df, overwrite = TRUE))
#> user system elapsed
#> 0.007 0.001 0.029
df_db
#> # Source: table<df> [?? x 2]
#> # Database: postgres [iangow@/tmp:5432/crsp]
#> id date
#> <int> <date>
#> 1 1 2020-01-01
#> 2 2 2020-01-02
#> 3 3 2020-01-03
#> 4 4 2020-01-04
#> 5 5 2020-01-05
#> 6 6 2020-01-06
#> 7 7 2020-01-07
#> 8 8 2020-01-08
#> 9 9 2020-01-09
#> 10 10 2020-01-10
Created on 2021-06-12 by the reprex package (v2.0.0)
Is this a local or remote database? Can you do a test against a remote database where the temporary table is used in another operation (e.g. we create two temporary tables and join)?
There are two observations driving this:
copy_to()
adds three roundtrips to the database, as opposed to only one if it's inline.I suspect that the combination of latency increase (if relevant) and roundtrips will show much more substantial differences.
Local and almost local (2m away). But if I do the same from a computer in Australia (16,913,000m away), which has a fairly consistent 250ms of latency, your hypothesis is confirmed (perhaps even more latency than would be explained by three round trips).
library(DBI)
library(dplyr, warn.conflicts = FALSE)
library(farr)
df_to_pg <- function(df, conn) {
temp <- purrr::pmap(df, db_row, .src=conn)
the_sql <- paste(temp, collapse = "\nUNION ALL\n")
dplyr::tbl(conn, sql(the_sql))
}
db_row <- function(..., .src) {
data <- dplyr::tibble(...)
values <- purrr::map(data, DBI::dbQuoteLiteral, conn = .src)
from <- paste0("SELECT ", paste0(
values, " AS ", DBI::dbQuoteIdentifier(.src, names(values)),
collapse = ", "))
}
N <- 10L
df <- tibble(id = 1:N,
date = seq(as.Date("2020-01-01"), length.out = N, by = 1))
pg <- dbConnect(RPostgres::Postgres(), bigint = "integer")
# Using my the approach from farr (github: iangow/farr)
system.time(df_db <- farr::df_to_pg(df, pg))
#> user system elapsed
#> 0.161 0.012 0.864
df_db
#> # Source: SQL [?? x 2]
#> # Database: postgres [igow@iangow.me:5434/crsp]
#> id date
#> <int> <date>
#> 1 1 2020-01-01
#> 2 2 2020-01-02
#> 3 3 2020-01-03
#> 4 4 2020-01-04
#> 5 5 2020-01-05
#> 6 6 2020-01-06
#> 7 7 2020-01-07
#> 8 8 2020-01-08
#> 9 9 2020-01-09
#> 10 10 2020-01-10
# Using function above
system.time(df_db <- df_to_pg(df, pg))
#> user system elapsed
#> 0.031 0.000 0.722
df_db
#> # Source: SQL [?? x 2]
#> # Database: postgres [igow@iangow.me:5434/crsp]
#> id date
#> <int> <date>
#> 1 1 2020-01-01
#> 2 2 2020-01-02
#> 3 3 2020-01-03
#> 4 4 2020-01-04
#> 5 5 2020-01-05
#> 6 6 2020-01-06
#> 7 7 2020-01-07
#> 8 8 2020-01-08
#> 9 9 2020-01-09
#> 10 10 2020-01-10
# Using copy_to
system.time(df_db <- copy_to(pg, df, overwrite = TRUE))
#> user system elapsed
#> 0.040 0.004 5.060
df_db
#> # Source: table<df> [?? x 2]
#> # Database: postgres [igow@iangow.me:5434/crsp]
#> id date
#> <int> <date>
#> 1 1 2020-01-01
#> 2 2 2020-01-02
#> 3 3 2020-01-03
#> 4 4 2020-01-04
#> 5 5 2020-01-05
#> 6 6 2020-01-06
#> 7 7 2020-01-07
#> 8 8 2020-01-08
#> 9 9 2020-01-09
#> 10 10 2020-01-10
Created on 2021-06-13 by the reprex package (v2.0.0)
Again with a larger data frame:
library(DBI)
library(dplyr, warn.conflicts = FALSE)
library(farr)
df_to_pg <- function(df, conn) {
temp <- purrr::pmap(df, db_row, .src=conn)
the_sql <- paste(temp, collapse = "\nUNION ALL\n")
dplyr::tbl(conn, sql(the_sql))
}
db_row <- function(..., .src) {
data <- dplyr::tibble(...)
values <- purrr::map(data, DBI::dbQuoteLiteral, conn = .src)
from <- paste0("SELECT ", paste0(
values, " AS ", DBI::dbQuoteIdentifier(.src, names(values)),
collapse = ", "))
}
N <- 5000L
df <- tibble(id = 1:N,
date = seq(as.Date("2020-01-01"), length.out = N, by = 1))
pg <- dbConnect(RPostgres::Postgres(), bigint = "integer")
# Using my the approach from farr (github: iangow/farr)
system.time(df_db <- farr::df_to_pg(df, pg))
#> user system elapsed
#> 0.187 0.008 1.629
df_db
#> # Source: SQL [?? x 2]
#> # Database: postgres [igow@iangow.me:5434/crsp]
#> id date
#> <int> <date>
#> 1 1 2020-01-01
#> 2 2 2020-01-02
#> 3 3 2020-01-03
#> 4 4 2020-01-04
#> 5 5 2020-01-05
#> 6 6 2020-01-06
#> 7 7 2020-01-07
#> 8 8 2020-01-08
#> 9 9 2020-01-09
#> 10 10 2020-01-10
#> # … with more rows
# Using function above
system.time(df_db <- df_to_pg(df, pg))
#> user system elapsed
#> 8.843 0.000 21.331
df_db
#> # Source: SQL [?? x 2]
#> # Database: postgres [igow@iangow.me:5434/crsp]
#> id date
#> <int> <date>
#> 1 1 2020-01-01
#> 2 2 2020-01-02
#> 3 3 2020-01-03
#> 4 4 2020-01-04
#> 5 5 2020-01-05
#> 6 6 2020-01-06
#> 7 7 2020-01-07
#> 8 8 2020-01-08
#> 9 9 2020-01-09
#> 10 10 2020-01-10
#> # … with more rows
# Using copy_to
system.time(df_db <- copy_to(pg, df, overwrite = TRUE))
#> user system elapsed
#> 0.062 0.000 5.503
df_db
#> # Source: table<df> [?? x 2]
#> # Database: postgres [igow@iangow.me:5434/crsp]
#> id date
#> <int> <date>
#> 1 1 2020-01-01
#> 2 2 2020-01-02
#> 3 3 2020-01-03
#> 4 4 2020-01-04
#> 5 5 2020-01-05
#> 6 6 2020-01-06
#> 7 7 2020-01-07
#> 8 8 2020-01-08
#> 9 9 2020-01-09
#> 10 10 2020-01-10
#> # … with more rows
# Using @krlmlr version
df_to_pg_2 <- function(df, conn) {
values <- purrr::map(df, DBI::dbQuoteLiteral, conn = conn)
names <- DBI::dbQuoteIdentifier(conn, names(df))
as <- t(as.data.frame(purrr::map2(values, names, ~ paste0(.x, " AS ", .y))))
collapsed <- purrr::map_chr(as.data.frame(as), paste, collapse = ", ")
select <- paste0("SELECT ", collapsed)
dplyr::tbl(conn, dplyr::sql(paste(select, collapse = "\nUNION ALL\n")))
}
system.time(df_db <- df_to_pg_2(df, pg))
#> user system elapsed
#> 0.122 0.004 11.851
df_db
#> # Source: SQL [?? x 2]
#> # Database: postgres [igow@iangow.me:5434/crsp]
#> id date
#> <int> <date>
#> 1 1 2020-01-01
#> 2 2 2020-01-02
#> 3 3 2020-01-03
#> 4 4 2020-01-04
#> 5 5 2020-01-05
#> 6 6 2020-01-06
#> 7 7 2020-01-07
#> 8 8 2020-01-08
#> 9 9 2020-01-09
#> 10 10 2020-01-10
#> # … with more rows
Created on 2021-06-13 by the reprex package (v2.0.0)
Very interesting! I think df_to_pg()
can be made faster by avoiding rowwise()
, does this affect your timings?
In the end, I suspect that Postgres's way of constructing an inline table still is faster but UNION ALL
might just work almost everywhere (except Oracle where we'll need DUAL
).
library(DBI)
library(tidyverse)
df_to_pg <- function(df, conn) {
df <- rowwise(df)
temp <- purrr::pmap(df, db_row, .src=conn)
the_sql <- paste(temp, collapse = "\nUNION ALL\n")
the_sql
}
db_row <- function(..., .src) {
data <- dplyr::tibble(...)
values <- purrr::map(data, DBI::dbQuoteLiteral, conn = .src)
from <- paste0("SELECT ", paste0(
values, " AS ", DBI::dbQuoteIdentifier(.src, names(values)),
collapse = ", "))
}
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
N <- 5000
df <- tibble(id = 1:N,
date = seq(as.Date("2020-01-01"), length.out = N, by = 1))
system.time(sql_1 <- df_to_pg(df, con))
#> user system elapsed
#> 8.149 0.020 8.181
df_to_pg_2 <- function(df, conn) {
values <- purrr::map(df, DBI::dbQuoteLiteral, conn = conn)
names <- DBI::dbQuoteIdentifier(conn, names(df))
as <- t(as.data.frame(map2(values, names, ~ paste0(.x, " AS ", .y))))
collapsed <- map_chr(as.data.frame(as), paste, collapse = ", ")
select <- paste0("SELECT ", collapsed)
paste(select, collapse = "\nUNION ALL\n")
}
system.time(sql_2 <- df_to_pg_2(df, con))
#> user system elapsed
#> 0.064 0.000 0.064
identical(sql_1, sql_2)
#> [1] TRUE
Created on 2021-06-13 by the reprex package (v2.0.0)
@krlmlr Updated comments above with code without rowwise()
and also with your df_to_pg_2
applied over a 16,913km distance.
So it's really worth to specialize here for Postgres, and for other databases it's unlikely to be useful for tables of more than 100 or so cells. Should we discuss these findings upstream?
@krlmlr Another approach is to use the VALUES
keyword, which seems to exist in PostgreSQL (tested below), MySQL, and Oracle, but not quite in SQLite.
No stack overflow issues and performance is pretty good.
Data below are for a 16,913km connection.
library(DBI)
library(dplyr, warn.conflicts = FALSE)
df_to_pg <- function(df, conn) {
collapse <- function(x) paste0("(", paste(x, collapse = ", "), ")")
names <- paste(DBI::dbQuoteIdentifier(conn, names(df)), collapse = ", ")
the_sql <-
df %>%
lapply(dbQuoteLiteral, conn = con) %>%
purrr::transpose() %>%
lapply(collapse) %>%
paste(collapse = ",\n") %>%
paste("SELECT * FROM (VALUES", ., ") AS t (", names, ")")
dplyr::tbl(conn, dplyr::sql(the_sql))
}
con <- dbConnect(RPostgres::Postgres(), bigint = "integer")
N <- 52000
df <- tibble(id = 1:N,
date = seq(as.Date("2020-01-01"), length.out = N, by = 1),
letter = rep(letters, N/26))
system.time(sql_1 <- df_to_pg(df, con))
#> user system elapsed
#> 0.918 0.008 3.531
system.time(dbWriteTable(con, "junk", df,
overwrite = TRUE, temporary = TRUE))
#> user system elapsed
#> 0.156 0.000 2.069
Created on 2021-06-15 by the reprex package (v2.0.0)
Updated code in last comment to use purrr::transpose
in place of as.data.frame() %>% t() %>% as.data.frame()
. I added a column to test data set, as documentation for purrr::transpose
suggested it might only work for 2-column data frames. But purrr::transpose
seems fine with the new 3-column data frame.
@krlmlr
So it's really worth to specialize here for Postgres, and for other databases it's unlikely to be useful for tables of more than 100 or so cells. Should we discuss these findings upstream?
What would be the natural upstream place to take this? If the most recent VALUES
-based approach works for several implementations, then perhaps it might belong at the DBI
level. If not, perhaps at RPostgres
. I'd guess that the relevant use case (i.e., read-only database connections) simply isn't relevant to SQLite, so an approach that covers PostgreSQL, MySQL, and Oracle and a few others may be general enough.
I don't know the details of the relation between dplyr::copy_to()
and DBI::dbWriteTable
(e.g., does the former call the latter?) So , focusing on DBI::dbWriteTable
, one approach would be to have an argument (e.g.,read_only
) that calls a version of the function above if TRUE
. (A more user-friendly approach might automatically detect that a connection is read-only and take care of this for the user, but there doesn't seem to be existing functionality to do this.)
Thanks. Confirming that this also works for the SQL Server.
I'd say this should live in dbplyr, happy to host it in dm for now.
We could add a method for SQLite, it could just call copy_to()
. The default copy_to()
calls dbWriteTable()
, we could also implement a DBI method that does this (but would duplicate the logic for SQLite). I'm not sure.
Caveat to following: I am just a user, not a developer.
Alternatively, perhaps this could be implemented (for now) in RPostgres
as a consequence of an extra argument (say, read_only
) to dbWriteTable
being TRUE
. This is a bit like copy = TRUE
, which seems to be in RPostgres:: dbWriteTable
, but not in DBI:: dbWriteTable
. From a user perspective dbWriteTable(..., read_only = TRUE)
would give a result not materially different from dbWriteTable(..., temporary = TRUE)
.
I figure there would be some testing that would need to be done (I assume that there are some tests for the existing RPostgres::dbWriteTable
). For example, do date-times get handled properly? Perhaps this is easiest done (for now) against a specific implementation. I'm happy to adapt the tests (i.e., check that dbWriteTable(..., read_only = TRUE)
gives the same results as dbWriteTable
as currently implemented).
Thanks. We might consider an sqlInlineTable()
in DBI indeed. I'm not sure it's a good fit for dbWriteTable()
.
For Postgres this should give very similar translations, this also makes it usable for other databases.
Thanks for open-sourcing this and posting in dbplyr, I went to the dbplyr issue tracker to suggest this feature. The main advantage over
copy_to()
is that this remains a lazy query until executed.Also, this perhaps needs
UNION ALL
to be more compatible across SQL dialects.