iangow / farr

Data and Code for Financial Accounting Research
https://cran.r-project.org/web/packages/farr/index.html
Other
15 stars 8 forks source link

Use DBI::dbQuoteLiteral() in df_to_pg() #1

Closed krlmlr closed 3 years ago

krlmlr commented 3 years ago

For Postgres this should give very similar translations, this also makes it usable for other databases.

Thanks for open-sourcing this and posting in dbplyr, I went to the dbplyr issue tracker to suggest this feature. The main advantage over copy_to() is that this remains a lazy query until executed.

Also, this perhaps needs UNION ALL to be more compatible across SQL dialects.

iangow commented 3 years ago

Simpler building block from @krlmlr:

memdb_row <- function(...) {
  db_row(..., .src = src_memdb())
}

db_row <- function(..., .src) {
  data <- tibble(...)
  stopifnot(nrow(data) == 1)
  values <- unlist(map(data, DBI::dbQuoteLiteral, conn = .src$con))

  from <- sql(paste0("SELECT ", paste0(
    values, " AS ", DBI::dbQuoteIdentifier(.src$con, names(values)),
    collapse = ", "
  )))
  dbplyr:::tbl.src_dbi(.src, from, vars = names(values))
}

memdb_row(a = 1, b = "x")
#> # Source:   SQL [?? x 2]
#> # Database: sqlite 3.35.5 [:memory:]
#>       a b    
#>   <int> <chr>
#> 1     1 x
memdb_row(a = 1, b = "x") %>%
  sql_render()
#> <SQL> SELECT 1 AS `a`, 'x' AS `b`
iangow commented 3 years ago

New version "works" (reprex below), but this function is surely pretty awful (it's a de minimis revision of the earlier version, which iterated over columns not rows).

df_to_pg <- function(df, conn) {

    temp_starter_sql <- list()
    for (i in 1:nrow(df)) {
        temp_starter_sql[[i]] = db_row(df[i, ], .src = conn)
    }

    Reduce(dplyr::union_all, temp_starter_sql)
}
library(farr)
library(dplyr, warn.conflicts = FALSE)
library(DBI)

pg <- dbConnect(RPostgres::Postgres(), bigint = "integer")

cars_pg <- df_to_pg(cars, pg)
cars_pg
#> # Source:   lazy query [?? x 2]
#> # Database: postgres [iangow@/tmp:5432/crsp]
#>    speed  dist
#>    <dbl> <dbl>
#>  1     4     2
#>  2     4    10
#>  3     7     4
#>  4     7    22
#>  5     8    16
#>  6     9    10
#>  7    10    18
#>  8    10    26
#>  9    10    34
#> 10    11    17
#> # … with more rows

Created on 2021-05-14 by the reprex package (v2.0.0)

iangow commented 3 years ago

@krlmlr i got the function down to a single line that calls your function. Thanks.

See here.

iangow commented 3 years ago

@krlmlr Unfortunately, Reduce gives an error like Error: C stack usage 7955336 is too close to the limit when I try to feed a fairly large data frame (about 10,000 rows). This is probably more than this function should be used for, but rather than editing the part of my "book" where I'm using it I just made a temporary fix of reverting back to the earlier version.

Below is the code prior to reversion. It's your db_row function, plus as one-line df_to_pg function that calls it.

df_to_pg <- function(df, conn) {
    Reduce(dplyr::union_all, purrr::pmap(df, db_row, .src=conn))
}

db_row <- function(..., .src) {
    data <- dplyr::tibble(...)
    stopifnot(nrow(data) == 1)
    values <- unlist(purrr::map(data, DBI::dbQuoteLiteral, conn = .src))

    from <- dbplyr::sql(paste0("SELECT ", paste0(
        values, " AS ", DBI::dbQuoteIdentifier(.src, names(values)),
        collapse = ", "
    )))
    dplyr::tbl(.src, from, vars = names(values))
}

Note that if I use purrr::pmap(df, db_row, .src=conn) alone (i.e., no Reduce), then it all works fine. If I replace the last line above with collect(dplyr::tbl(.src, from, vars = names(values))), it all works fine. So it's something about the interaction of Reduce and a tbl_sql that doesn't work. Note that I get the same error with purrr::reduce.

Is there an alternative way to union_all a list of tables? (I think there might be merit in having a bind_rows function that works with database tables … though it seems that Reduce(x, union_all) isn't the best way to do this.)

krlmlr commented 3 years ago

I think we only need paste0() with and without collapse = "...", and do.call() and the rlang equivalent. Plus a list() matrix and t() . We'd remove stopifnot() in db_row() and make it work for more than one row. The unlist() has to go.

Let me know if this makes sense.

krlmlr commented 3 years ago

Not sure about the data structure, maybe the main work can be done on a list of character vectors, which can then be transposed with t() .

iangow commented 3 years ago

@krlmlr I confess I'm a bit confused. Are these (conceptually!) the steps you have in mind?

# Some toy data
df <- tibble::tibble(some_date = seq(as.Date("2020-01-04"), by=1, length.out = 3),
             some_num = 2021:2023L,
             some_text = c("Hi!", "Huh?", "Bye"))
df
#> # A tibble: 3 x 3
#>   some_date  some_num some_text
#>   <date>        <int> <chr>    
#> 1 2020-01-04     2021 Hi!      
#> 2 2020-01-05     2022 Huh?     
#> 3 2020-01-06     2023 Bye

# Step 1.
# Use some function to turn a data frame into some kind 
# of matrix. Presumably this would use DBI::dbQuoteLiteral and 
# DBI::dbQuoteIdentifier. 
# For now, I just hand-coded this step.
df_new <- 
    tibble::tibble(some_date = paste0("'", df$some_date, "'::date AS some_date"),
                   some_num = paste0("'", df$some_num, "'::integer AS some_num"),
                   some_text = paste0("'", df$some_text, "'::text AS some_text"))
df_new <- as.matrix(df_new)

# Step 2.
# Transpose the matrix ...
# I used as_tibble so as to get a list that I could use lapply on (below).
# I have no doubt that there's a better way.
df_transposed <- tibble::as_tibble(t(df_new))
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
#> Using compatibility `.name_repair`.
df_transposed
#> # A tibble: 3 x 3
#>   V1                         V2                        V3                       
#>   <chr>                      <chr>                     <chr>                    
#> 1 '2020-01-04'::date AS som… '2020-01-05'::date AS so… '2020-01-06'::date AS so…
#> 2 '2021'::integer AS some_n… '2022'::integer AS some_… '2023'::integer AS some_…
#> 3 'Hi!'::text AS some_text   'Huh?'::text AS some_text 'Bye'::text AS some_text

# Step 3.
# Make SQL that could be rendered by the database
make_db_row <- function(x) {
    paste0("(SELECT ", paste0(x, collapse = ", "), ")")
}

df_row_list <- lapply(df_transposed, make_db_row)
df_sql <- paste0(df_row_list, collapse = "\nUNION ALL\n")
cat(df_sql)
#> (SELECT '2020-01-04'::date AS some_date, '2021'::integer AS some_num, 'Hi!'::text AS some_text)
#> UNION ALL
#> (SELECT '2020-01-05'::date AS some_date, '2022'::integer AS some_num, 'Huh?'::text AS some_text)
#> UNION ALL
#> (SELECT '2020-01-06'::date AS some_date, '2023'::integer AS some_num, 'Bye'::text AS some_text)

# Step 4.
# Render the SQL in the database and return a table

Created on 2021-05-19 by the reprex package (v2.0.0)

krlmlr commented 3 years ago

Yeah, step 1 is the most tricky one, and also needs a ", " after all but the last columns. This can be achieved I think with a sequence of map(x, ~ paste(...)) and map2(x, y, ~ paste(...)) calls.

After step 1 we then can do.call(x, paste0) to efficiently concatenate all the strings for all the rows. Step 2 seems no longer needed. I don't remember why I thought we'd need t().

Then, we paste0() the entire resulting vector and paste(x, collapse = " UNION ALL ") to get the entire expression. All with vectorized operations, the map() calls, in the beginning, being the exception but still fast because we're using one paste() per column only (and not one per row).

iangow commented 3 years ago
df_to_pg <- function(df, conn) {
    df <- rowwise(df)
    temp <- purrr::pmap(df, db_row, .src=conn)
    the_sql <- paste(temp, collapse = "\nUNION ALL\n")
    dplyr::tbl(conn, sql(the_sql))
}

db_row <- function(..., .src) {
    data <- dplyr::tibble(...)
    values <- purrr::map(data, DBI::dbQuoteLiteral, conn = .src)

    from <- paste0("SELECT ", paste0(
        values, " AS ", DBI::dbQuoteIdentifier(.src, names(values)),
        collapse = ", "))
}

The above works, but it seems that I have just moved the stack overflow issue to PostgreSQL:

Error: Failed to prepare query: ERROR:  stack depth limit exceeded
HINT:  Increase the configuration parameter "max_stack_depth" (currently 2048kB), after ensuring the platform's stack depth limit is adequate.

The original version I made works fine, albeit only with PostgreSQL and it's a bit hacky.

krlmlr commented 3 years ago

How many rows/cells does the new solution support?

iangow commented 3 years ago

About 6,000 rows from one data set (seems the same against two versions of PostgreSQL: one on a Mac mini M1 and one on Ubuntu 21.04). Mind you, I guess that copy_to is pretty difficult to beat on PostgreSQL, as PostgreSQL has pretty fast approaches for this.

There is certainly no obvious performance gain from the approach above (though the function seems grossly inefficient), though it's pretty competitive with small tables.

library(DBI)
library(dplyr, warn.conflicts = FALSE)
library(farr)

df_to_pg <- function(df, conn) {
    df <- rowwise(df)
    temp <- purrr::pmap(df, db_row, .src=conn)
    the_sql <- paste(temp, collapse = "\nUNION ALL\n")
    dplyr::tbl(conn, sql(the_sql))
}

db_row <- function(..., .src) {
    data <- dplyr::tibble(...)
    values <- purrr::map(data, DBI::dbQuoteLiteral, conn = .src)

    from <- paste0("SELECT ", paste0(
        values, " AS ", DBI::dbQuoteIdentifier(.src, names(values)),
        collapse = ", "))
}

N <- 6000L

df <- tibble(id = 1:N, 
             date = seq(as.Date("2020-01-01"), length.out = N, by = 1))

pg <- dbConnect(RPostgres::Postgres(), bigint = "integer")

# Using my "original approach" (github: iangow/farr)
system.time(df_db <- farr::df_to_pg(df, pg))
#>    user  system elapsed 
#>   0.078   0.004   0.090
df_db
#> # Source:   SQL [?? x 2]
#> # Database: postgres [iangow@localhost:5432/crsp]
#>       id date      
#>    <int> <date>    
#>  1     1 2020-01-01
#>  2     2 2020-01-02
#>  3     3 2020-01-03
#>  4     4 2020-01-04
#>  5     5 2020-01-05
#>  6     6 2020-01-06
#>  7     7 2020-01-07
#>  8     8 2020-01-08
#>  9     9 2020-01-09
#> 10    10 2020-01-10
#> # … with more rows

# Using function above
system.time(df_db <- df_to_pg(df, pg))
#>    user  system elapsed 
#>   3.059   0.026  10.134
df_db
#> # Source:   SQL [?? x 2]
#> # Database: postgres [iangow@localhost:5432/crsp]
#>       id date      
#>    <int> <date>    
#>  1     1 2020-01-01
#>  2     2 2020-01-02
#>  3     3 2020-01-03
#>  4     4 2020-01-04
#>  5     5 2020-01-05
#>  6     6 2020-01-06
#>  7     7 2020-01-07
#>  8     8 2020-01-08
#>  9     9 2020-01-09
#> 10    10 2020-01-10
#> # … with more rows

# Using copy_to
system.time(df_db <- copy_to(pg, df, overwrite = TRUE))
#>    user  system elapsed 
#>   0.014   0.001   0.031
df_db
#> # Source:   table<df> [?? x 2]
#> # Database: postgres [iangow@localhost:5432/crsp]
#>       id date      
#>    <int> <date>    
#>  1     1 2020-01-01
#>  2     2 2020-01-02
#>  3     3 2020-01-03
#>  4     4 2020-01-04
#>  5     5 2020-01-05
#>  6     6 2020-01-06
#>  7     7 2020-01-07
#>  8     8 2020-01-08
#>  9     9 2020-01-09
#> 10    10 2020-01-10
#> # … with more rows

Created on 2021-06-11 by the reprex package (v2.0.0)

iangow commented 3 years ago

@krlmlr

For a small data set (n = 10), (my crude implementation of) your approach is fastest. But it's fast any of the ways used below.

library(DBI)
library(dplyr, warn.conflicts = FALSE)
library(farr)

df_to_pg <- function(df, conn) {
    df <- rowwise(df)
    temp <- purrr::pmap(df, db_row, .src=conn)
    the_sql <- paste(temp, collapse = "\nUNION ALL\n")
    dplyr::tbl(conn, sql(the_sql))
}

db_row <- function(..., .src) {
    data <- dplyr::tibble(...)
    values <- purrr::map(data, DBI::dbQuoteLiteral, conn = .src)

    from <- paste0("SELECT ", paste0(
        values, " AS ", DBI::dbQuoteIdentifier(.src, names(values)),
        collapse = ", "))
}

N <- 10L

df <- tibble(id = 1:N, 
             date = seq(as.Date("2020-01-01"), length.out = N, by = 1))

pg <- dbConnect(RPostgres::Postgres(), bigint = "integer")

# Using my the approach from farr (github: iangow/farr)
system.time(df_db <- farr::df_to_pg(df, pg))
#>    user  system elapsed 
#>   0.060   0.003   0.066
df_db
#> # Source:   SQL [?? x 2]
#> # Database: postgres [iangow@/tmp:5432/crsp]
#>       id date      
#>    <int> <date>    
#>  1     1 2020-01-01
#>  2     2 2020-01-02
#>  3     3 2020-01-03
#>  4     4 2020-01-04
#>  5     5 2020-01-05
#>  6     6 2020-01-06
#>  7     7 2020-01-07
#>  8     8 2020-01-08
#>  9     9 2020-01-09
#> 10    10 2020-01-10

# Using function above
system.time(df_db <- df_to_pg(df, pg))
#>    user  system elapsed 
#>   0.013   0.000   0.013
df_db
#> # Source:   SQL [?? x 2]
#> # Database: postgres [iangow@/tmp:5432/crsp]
#>       id date      
#>    <int> <date>    
#>  1     1 2020-01-01
#>  2     2 2020-01-02
#>  3     3 2020-01-03
#>  4     4 2020-01-04
#>  5     5 2020-01-05
#>  6     6 2020-01-06
#>  7     7 2020-01-07
#>  8     8 2020-01-08
#>  9     9 2020-01-09
#> 10    10 2020-01-10

# Using copy_to
system.time(df_db <- copy_to(pg, df, overwrite = TRUE))
#>    user  system elapsed 
#>   0.007   0.001   0.029
df_db
#> # Source:   table<df> [?? x 2]
#> # Database: postgres [iangow@/tmp:5432/crsp]
#>       id date      
#>    <int> <date>    
#>  1     1 2020-01-01
#>  2     2 2020-01-02
#>  3     3 2020-01-03
#>  4     4 2020-01-04
#>  5     5 2020-01-05
#>  6     6 2020-01-06
#>  7     7 2020-01-07
#>  8     8 2020-01-08
#>  9     9 2020-01-09
#> 10    10 2020-01-10

Created on 2021-06-12 by the reprex package (v2.0.0)

krlmlr commented 3 years ago

Is this a local or remote database? Can you do a test against a remote database where the temporary table is used in another operation (e.g. we create two temporary tables and join)?

krlmlr commented 3 years ago

There are two observations driving this:

I suspect that the combination of latency increase (if relevant) and roundtrips will show much more substantial differences.

iangow commented 3 years ago

Local and almost local (2m away). But if I do the same from a computer in Australia (16,913,000m away), which has a fairly consistent 250ms of latency, your hypothesis is confirmed (perhaps even more latency than would be explained by three round trips).

library(DBI)
library(dplyr, warn.conflicts = FALSE)
library(farr)

df_to_pg <- function(df, conn) {
    temp <- purrr::pmap(df, db_row, .src=conn)
    the_sql <- paste(temp, collapse = "\nUNION ALL\n")
    dplyr::tbl(conn, sql(the_sql))
}

db_row <- function(..., .src) {
    data <- dplyr::tibble(...)
    values <- purrr::map(data, DBI::dbQuoteLiteral, conn = .src)

    from <- paste0("SELECT ", paste0(
        values, " AS ", DBI::dbQuoteIdentifier(.src, names(values)),
        collapse = ", "))
}

N <- 10L

df <- tibble(id = 1:N,
             date = seq(as.Date("2020-01-01"), length.out = N, by = 1))

pg <- dbConnect(RPostgres::Postgres(), bigint = "integer")

# Using my the approach from farr (github: iangow/farr)
system.time(df_db <- farr::df_to_pg(df, pg))
#>    user  system elapsed 
#>   0.161   0.012   0.864
df_db
#> # Source:   SQL [?? x 2]
#> # Database: postgres [igow@iangow.me:5434/crsp]
#>       id date      
#>    <int> <date>    
#>  1     1 2020-01-01
#>  2     2 2020-01-02
#>  3     3 2020-01-03
#>  4     4 2020-01-04
#>  5     5 2020-01-05
#>  6     6 2020-01-06
#>  7     7 2020-01-07
#>  8     8 2020-01-08
#>  9     9 2020-01-09
#> 10    10 2020-01-10

# Using function above
system.time(df_db <- df_to_pg(df, pg))
#>    user  system elapsed 
#>   0.031   0.000   0.722
df_db
#> # Source:   SQL [?? x 2]
#> # Database: postgres [igow@iangow.me:5434/crsp]
#>       id date      
#>    <int> <date>    
#>  1     1 2020-01-01
#>  2     2 2020-01-02
#>  3     3 2020-01-03
#>  4     4 2020-01-04
#>  5     5 2020-01-05
#>  6     6 2020-01-06
#>  7     7 2020-01-07
#>  8     8 2020-01-08
#>  9     9 2020-01-09
#> 10    10 2020-01-10

# Using copy_to
system.time(df_db <- copy_to(pg, df, overwrite = TRUE))
#>    user  system elapsed 
#>   0.040   0.004   5.060
df_db
#> # Source:   table<df> [?? x 2]
#> # Database: postgres [igow@iangow.me:5434/crsp]
#>       id date      
#>    <int> <date>    
#>  1     1 2020-01-01
#>  2     2 2020-01-02
#>  3     3 2020-01-03
#>  4     4 2020-01-04
#>  5     5 2020-01-05
#>  6     6 2020-01-06
#>  7     7 2020-01-07
#>  8     8 2020-01-08
#>  9     9 2020-01-09
#> 10    10 2020-01-10

Created on 2021-06-13 by the reprex package (v2.0.0)

iangow commented 3 years ago

Again with a larger data frame:

library(DBI)
library(dplyr, warn.conflicts = FALSE)
library(farr)

df_to_pg <- function(df, conn) {
    temp <- purrr::pmap(df, db_row, .src=conn)
    the_sql <- paste(temp, collapse = "\nUNION ALL\n")
    dplyr::tbl(conn, sql(the_sql))
}

db_row <- function(..., .src) {
    data <- dplyr::tibble(...)
    values <- purrr::map(data, DBI::dbQuoteLiteral, conn = .src)

    from <- paste0("SELECT ", paste0(
        values, " AS ", DBI::dbQuoteIdentifier(.src, names(values)),
        collapse = ", "))
}

N <- 5000L

df <- tibble(id = 1:N,
             date = seq(as.Date("2020-01-01"), length.out = N, by = 1))

pg <- dbConnect(RPostgres::Postgres(), bigint = "integer")

# Using my the approach from farr (github: iangow/farr)
system.time(df_db <- farr::df_to_pg(df, pg))
#>    user  system elapsed 
#>   0.187   0.008   1.629
df_db
#> # Source:   SQL [?? x 2]
#> # Database: postgres [igow@iangow.me:5434/crsp]
#>       id date      
#>    <int> <date>    
#>  1     1 2020-01-01
#>  2     2 2020-01-02
#>  3     3 2020-01-03
#>  4     4 2020-01-04
#>  5     5 2020-01-05
#>  6     6 2020-01-06
#>  7     7 2020-01-07
#>  8     8 2020-01-08
#>  9     9 2020-01-09
#> 10    10 2020-01-10
#> # … with more rows

# Using function above
system.time(df_db <- df_to_pg(df, pg))
#>    user  system elapsed 
#>   8.843   0.000  21.331
df_db
#> # Source:   SQL [?? x 2]
#> # Database: postgres [igow@iangow.me:5434/crsp]
#>       id date      
#>    <int> <date>    
#>  1     1 2020-01-01
#>  2     2 2020-01-02
#>  3     3 2020-01-03
#>  4     4 2020-01-04
#>  5     5 2020-01-05
#>  6     6 2020-01-06
#>  7     7 2020-01-07
#>  8     8 2020-01-08
#>  9     9 2020-01-09
#> 10    10 2020-01-10
#> # … with more rows

# Using copy_to
system.time(df_db <- copy_to(pg, df, overwrite = TRUE))
#>    user  system elapsed 
#>   0.062   0.000   5.503
df_db
#> # Source:   table<df> [?? x 2]
#> # Database: postgres [igow@iangow.me:5434/crsp]
#>       id date      
#>    <int> <date>    
#>  1     1 2020-01-01
#>  2     2 2020-01-02
#>  3     3 2020-01-03
#>  4     4 2020-01-04
#>  5     5 2020-01-05
#>  6     6 2020-01-06
#>  7     7 2020-01-07
#>  8     8 2020-01-08
#>  9     9 2020-01-09
#> 10    10 2020-01-10
#> # … with more rows

# Using @krlmlr version
df_to_pg_2 <- function(df, conn) {
    values <- purrr::map(df, DBI::dbQuoteLiteral, conn = conn)
    names <- DBI::dbQuoteIdentifier(conn, names(df))

    as <- t(as.data.frame(purrr::map2(values, names, ~ paste0(.x, " AS ", .y))))
    collapsed <- purrr::map_chr(as.data.frame(as), paste, collapse = ", ")
    select <- paste0("SELECT ", collapsed)
    dplyr::tbl(conn, dplyr::sql(paste(select, collapse = "\nUNION ALL\n")))
}

system.time(df_db <- df_to_pg_2(df, pg))
#>    user  system elapsed 
#>   0.122   0.004  11.851
df_db
#> # Source:   SQL [?? x 2]
#> # Database: postgres [igow@iangow.me:5434/crsp]
#>       id date      
#>    <int> <date>    
#>  1     1 2020-01-01
#>  2     2 2020-01-02
#>  3     3 2020-01-03
#>  4     4 2020-01-04
#>  5     5 2020-01-05
#>  6     6 2020-01-06
#>  7     7 2020-01-07
#>  8     8 2020-01-08
#>  9     9 2020-01-09
#> 10    10 2020-01-10
#> # … with more rows

Created on 2021-06-13 by the reprex package (v2.0.0)

krlmlr commented 3 years ago

Very interesting! I think df_to_pg() can be made faster by avoiding rowwise(), does this affect your timings?

In the end, I suspect that Postgres's way of constructing an inline table still is faster but UNION ALL might just work almost everywhere (except Oracle where we'll need DUAL).

library(DBI)
library(tidyverse)

df_to_pg <- function(df, conn) {
  df <- rowwise(df)
  temp <- purrr::pmap(df, db_row, .src=conn)
  the_sql <- paste(temp, collapse = "\nUNION ALL\n")
  the_sql
}

db_row <- function(..., .src) {
  data <- dplyr::tibble(...)
  values <- purrr::map(data, DBI::dbQuoteLiteral, conn = .src)

  from <- paste0("SELECT ", paste0(
    values, " AS ", DBI::dbQuoteIdentifier(.src, names(values)),
    collapse = ", "))
}

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

N <- 5000

df <- tibble(id = 1:N,
             date = seq(as.Date("2020-01-01"), length.out = N, by = 1))

system.time(sql_1 <- df_to_pg(df, con))
#>    user  system elapsed 
#>   8.149   0.020   8.181

df_to_pg_2 <- function(df, conn) {
  values <- purrr::map(df, DBI::dbQuoteLiteral, conn = conn)
  names <- DBI::dbQuoteIdentifier(conn, names(df))

  as <- t(as.data.frame(map2(values, names, ~ paste0(.x, " AS ", .y))))
  collapsed <- map_chr(as.data.frame(as), paste, collapse = ", ")
  select <- paste0("SELECT ", collapsed)
  paste(select, collapse = "\nUNION ALL\n")
}

system.time(sql_2 <- df_to_pg_2(df, con))
#>    user  system elapsed 
#>   0.064   0.000   0.064

identical(sql_1, sql_2)
#> [1] TRUE

Created on 2021-06-13 by the reprex package (v2.0.0)

iangow commented 3 years ago

@krlmlr Updated comments above with code without rowwise() and also with your df_to_pg_2 applied over a 16,913km distance.

krlmlr commented 3 years ago

So it's really worth to specialize here for Postgres, and for other databases it's unlikely to be useful for tables of more than 100 or so cells. Should we discuss these findings upstream?

iangow commented 3 years ago

@krlmlr Another approach is to use the VALUES keyword, which seems to exist in PostgreSQL (tested below), MySQL, and Oracle, but not quite in SQLite.

No stack overflow issues and performance is pretty good.

Data below are for a 16,913km connection.

library(DBI)
library(dplyr, warn.conflicts = FALSE)

df_to_pg <- function(df, conn) {

    collapse <- function(x) paste0("(", paste(x, collapse = ", "), ")")

    names <- paste(DBI::dbQuoteIdentifier(conn, names(df)), collapse = ", ")

    the_sql <-
        df %>%
        lapply(dbQuoteLiteral, conn = con) %>%
        purrr::transpose() %>%
        lapply(collapse) %>%
        paste(collapse = ",\n") %>%
        paste("SELECT * FROM (VALUES", ., ") AS t (", names, ")")
    dplyr::tbl(conn, dplyr::sql(the_sql))
}

con <- dbConnect(RPostgres::Postgres(), bigint = "integer")
N <- 52000

df <- tibble(id = 1:N,
             date = seq(as.Date("2020-01-01"), length.out = N, by = 1),
             letter = rep(letters, N/26))

system.time(sql_1 <- df_to_pg(df, con))
#>    user  system elapsed 
#>   0.918   0.008   3.531

system.time(dbWriteTable(con, "junk", df, 
                         overwrite = TRUE, temporary = TRUE))
#>    user  system elapsed 
#>   0.156   0.000   2.069

Created on 2021-06-15 by the reprex package (v2.0.0)

iangow commented 3 years ago

Updated code in last comment to use purrr::transpose in place of as.data.frame() %>% t() %>% as.data.frame(). I added a column to test data set, as documentation for purrr::transpose suggested it might only work for 2-column data frames. But purrr::transpose seems fine with the new 3-column data frame.

iangow commented 3 years ago

@krlmlr

So it's really worth to specialize here for Postgres, and for other databases it's unlikely to be useful for tables of more than 100 or so cells. Should we discuss these findings upstream?

What would be the natural upstream place to take this? If the most recent VALUES-based approach works for several implementations, then perhaps it might belong at the DBI level. If not, perhaps at RPostgres. I'd guess that the relevant use case (i.e., read-only database connections) simply isn't relevant to SQLite, so an approach that covers PostgreSQL, MySQL, and Oracle and a few others may be general enough.

I don't know the details of the relation between dplyr::copy_to() and DBI::dbWriteTable (e.g., does the former call the latter?) So , focusing on DBI::dbWriteTable, one approach would be to have an argument (e.g.,read_only) that calls a version of the function above if TRUE. (A more user-friendly approach might automatically detect that a connection is read-only and take care of this for the user, but there doesn't seem to be existing functionality to do this.)

krlmlr commented 3 years ago

Thanks. Confirming that this also works for the SQL Server.

I'd say this should live in dbplyr, happy to host it in dm for now.

We could add a method for SQLite, it could just call copy_to() . The default copy_to() calls dbWriteTable(), we could also implement a DBI method that does this (but would duplicate the logic for SQLite). I'm not sure.

iangow commented 3 years ago

Caveat to following: I am just a user, not a developer.

Alternatively, perhaps this could be implemented (for now) in RPostgres as a consequence of an extra argument (say, read_only) to dbWriteTable being TRUE. This is a bit like copy = TRUE, which seems to be in RPostgres:: dbWriteTable, but not in DBI:: dbWriteTable. From a user perspective dbWriteTable(..., read_only = TRUE) would give a result not materially different from dbWriteTable(..., temporary = TRUE).

I figure there would be some testing that would need to be done (I assume that there are some tests for the existing RPostgres::dbWriteTable). For example, do date-times get handled properly? Perhaps this is easiest done (for now) against a specific implementation. I'm happy to adapt the tests (i.e., check that dbWriteTable(..., read_only = TRUE) gives the same results as dbWriteTable as currently implemented).

krlmlr commented 3 years ago

Thanks. We might consider an sqlInlineTable() in DBI indeed. I'm not sure it's a good fit for dbWriteTable().