r-spatial / sf

Simple Features for R
https://r-spatial.github.io/sf/
Other
1.35k stars 300 forks source link

sf::st_write_db performance w/ large datasets #285

Closed jmlondon closed 7 years ago

jmlondon commented 7 years ago

Even within the previous sp/rgdal framework, I've always found the time it takes to write large spatial objects to postgis frustratingly slow. 1-25k records is tolerable, but 100k+ requires more patience than I normally have. I had hoped sf's adoption of simple features would improve performance, but it appears there is still a non-linear degradation in write performance as the size of the dataset increases.

As an example:

library(sf)
library(RPostgreSQL)

x = st_sfc(st_polygon(list(rbind(
c(0,0),c(90,0),
c(90,90),c(0,90),c(0,0)))), 
crs = st_crs(4326))

pts_1k <- sf::st_sample(x,1000)
d <- data.frame(oid=1:1000)
d$geom <- pts_1k
pts_1k <- sf::st_as_sf(d)

pts_25k <- sf::st_sample(x,25000)
d <- data.frame(oid=1:25000)
d$geom <- pts_25k
pts_25k <- sf::st_as_sf(d)

pts_50k <- sf::st_sample(x,50000)
d <- data.frame(oid=1:50000)
d$geom <- pts_50k
pts_50k <- sf::st_as_sf(d)
pg = dbDriver("PostgreSQL")
con = dbConnect(pg, host = "localhost")
dbGetQuery(con, 'DROP TABLE IF EXISTS pts_1k')

system.time(
  sf::st_write_db(con, pts_1k, table = "pts_1k")
)
> user  system elapsed 
> 1.214   0.089   1.841 
dbGetQuery(con, 'DROP TABLE IF EXISTS pts_25k')

system.time(
  sf::st_write_db(con, pts_25k, table = "pts_25k")
)
> user  system elapsed 
> 24.250   0.910 125.484
dbExecute(con, 'DROP TABLE IF EXISTS pts_50k')

system.time(
  sf::st_write_db(con, pts_50k, table = "pts_50k")
)
dbDisconnect(con)
> user  system elapsed 
> 54.216   2.054 454.383 

Near as I can tell, the performance hit comes from the rowwise UPDATE calls setting the geometry for each row in the table. There may be some very good reasons for this approach, but the following code demonstrates another approach that relies on the postgis ST_GeomFromEWKT (could also use ST_GeomFromWKT or ST_GeomFromWKB) function to achieve the same result in significantly less time

pg = dbDriver("PostgreSQL")
con = dbConnect(pg, host = "localhost")

.st_write_db <- function() {
df <- pts_50k
table <- "pts_50k"

df$geom_ewkt <- sf::st_as_text(pts_50k$geom,EWKT=TRUE)

df[[attr(df, "sf_column")]] = NULL
class(df) = "data.frame"

dbExecute(con,'DROP TABLE IF EXISTS pts_50k')
dbWriteTable(con, table, df)
dbExecute(con,
          "ALTER TABLE pts_50k ADD COLUMN geom geometry(Geometry);"
)

dbExecute(con,
          "UPDATE pts_50k set geom = ST_GeomFromEWKT(geom_ewkt);
          ALTER TABLE pts_50k DROP COLUMN IF EXISTS geom_ewkt"
)
}

system.time(.st_write_db())
dbDisconnect(con)
> user  system elapsed 
> 2.502   0.005   2.880 

I generally accept that folks who know more than me have spent more time thinking about these things than I have. So, I won't be surprised to learn there are some problems with this approach that I haven't considered. But, given the increasing size of spatial datasets, it seems worth the effort to explore performance improvements like this. I'd be happy to flesh out a more generalized approach that could either replace st_write_db or serve as an alternative.

edzer commented 7 years ago

@etiennebr any suggestions here?

edzer commented 7 years ago

Just out of curiosity, did you try

sf::st_write_db(con, pts_50k, table = "pts_50k", binary = FALSE)

?

jmlondon commented 7 years ago
> system.time(
+   sf::st_write_db(con, pts_50k, table = "pts_50k",binary=FALSE)
+ )
>   user  system elapsed 
> 64.070   1.941 503.445 
etiennebr commented 7 years ago

I agree with you. Have a look at 23f00d7 (there's a PR #279). It's a similar approach to minimize db calls. I saw some good improvements and certainly we can do even better. Also, RPostgres might provide an interface to create that table in a single call, but I think it's not ready yet.

I was thinking about setting up a benchmark to try multiple approaches. I believe current test coverage for databases doesn't cover all cases yet but it still provides a good test, so approaches that pass should all be considered for performance. Also, we've discussed of a database-specific approach, so that should open the door to database-specific solutions.

jmlondon commented 7 years ago

Glad to see there has been some progress on the performance issue. I'm regularly dealing with 500,000+ point datasets db write performance is a common bottleneck. I'm happy to contribute where I can and help test/benchmark. I'll give your version of st_write_db a try this week.

jmlondon commented 7 years ago

I noticed that PR #279 had been merged into master and so took some time to test with my example.

system.time(
    sf::st_write_db(con, pts_50k, table = "pts_50k")
)
> user  system elapsed 
> 0.953   0.073 458.849 

Note, at least for this example, performance is still slow.

With that in mind, I developed a generalized version of my previous function/example, st_write_db_fast. I had to drop the binary and append options, but all the other capabilities remain. I think append could be added, but the binary option doesn't work b/c it isn't supported in RPostgreSQL. And, sending all the data via RPostgreSQL supported data types is where we gain the performance.

st_write_db_fast = function(conn = NULL, obj, table = substitute(obj), 
                            geom_name = "geom",
                            ..., overwrite = FALSE, 
                            debug = FALSE) {
  DEBUG = function(x) { if (debug) message(x); x }
  if (is.null(conn))
    stop("No connection provided")
  table <- sf:::schema_table(table)

  if (sf:::db_exists(conn, table)) {
    if (overwrite) {
      DBI::dbGetQuery(conn, DEBUG(paste("drop table", 
                                        paste(table, collapse = "."), ";")))
    } else {
      stop("Table ", paste(table, collapse = "."), 
           " exists already, use overwrite = TRUE", 
           call. = FALSE)
    }
  }

  df = obj
  df$geom_ewkt <- sf::st_as_text(df$geom,EWKT=TRUE)

  df[[attr(df, "sf_column")]] = NULL
  class(df) = "data.frame"

  dbWriteTable(conn, table, sf:::clean_columns(df, factorsAsCharacter = TRUE), ...)

  query = DEBUG(paste0("ALTER TABLE ", paste(table, collapse = "."),
                       " ADD COLUMN ", geom_name," geometry(Geometry)"))
  dbExecute(conn, query)

  query = DEBUG(paste0("UPDATE ", paste(table, collapse = "."),
                 " set ", geom_name," = ST_GeomFromEWKT(geom_ewkt);
                 ALTER TABLE ", paste(table, collapse = "."), 
                 " DROP COLUMN IF EXISTS geom_ewkt"))
  dbExecute(conn, query)
}

running the previous example with this function

system.time(
    st_write_db_fast(con, pts_50k, table = "pts_50k")
)
> user  system elapsed 
> 2.912   0.026   3.363 

Given the performance improvements, I'd like to see this somehow incorporated into sf. I'm not sure if doing this as a separate st_write_db_fast function or to merge into st_write_db with a fast=TRUE option is the best approach. Either way, I'm happy to flesh out and test further and create a pull request.

edzer commented 7 years ago

Running the following script

library(sf)
demo(nc,ask=F,echo=F)
g = st_geometry(nc)

n = 100
fn1 = function() {
  for (i in 1:n)
    x = st_as_text(g)
}
fn2 = function() {
  for (i in 1:n) 
    x = st_as_binary(g, hex = TRUE) # hex = TRUE now works
}

system.time(fn1())
system.time(fn2())

I see that in sf, writing to binary is 10 x as fast as writing to text. In addition, with WKT transfer it's quite likely you won't get clean roundtrips R->PG->R or the other way around. So although I'm happy about the speedups you see, I'm quite curious where the bottleneck in the binary transfer is.

edzer commented 7 years ago

Think I got it; after updating sf, please try out the following on your dataset, and compare binary=FALSE with binary=TRUE:

st_write_db_fast = function(conn = NULL, obj, table = substitute(obj),
                            geom_name = "geometry",
                            ..., overwrite = FALSE,
                            debug = FALSE, binary = TRUE) {
  DEBUG = function(x) { if (debug) message(x); x }
  if (is.null(conn))
    stop("No connection provided")
  table <- sf:::schema_table(table)

  if (sf:::db_exists(conn, table)) {
    if (overwrite) {
      DBI::dbGetQuery(conn, DEBUG(paste("drop table",
                                        paste(table, collapse = "."), ";")))
    } else {
      stop("Table ", paste(table, collapse = "."),
           " exists already, use overwrite = TRUE",
           call. = FALSE)
    }
  }

  df = obj
  df$geom_ewk <- if (binary) {
        sf::st_as_binary(df[[geom_name]], EWKT = TRUE, hex = TRUE)
    } else
        sf::st_as_text(df[[geom_name]], EWKT = TRUE)

  df[[attr(df, "sf_column")]] = NULL
  class(df) = "data.frame"
  dbWriteTable(conn, table, sf:::clean_columns(df, factorsAsCharacter = TRUE), ...)

  query = DEBUG(paste0("ALTER TABLE ", paste(table, collapse = "."),
                       " ADD COLUMN ", geom_name," geometry(Geometry)"))
  dbExecute(conn, query)

  dbExecute(conn, DEBUG("set standard_conforming_strings = on;"))

  query = if (binary)
        DEBUG(paste0("UPDATE ", paste(table, collapse = "."),
                 " set ", geom_name," = ST_GeomFromEWKB(cast(geom_ewk as geometry));
                 ALTER TABLE ", paste(table, collapse = "."),
                 " DROP COLUMN IF EXISTS geom_ewk"))
    else
        DEBUG(paste0("UPDATE ", paste(table, collapse = "."),
                 " set ", geom_name," = ST_GeomFromEWKT(geom_ewk);
                 ALTER TABLE ", paste(table, collapse = "."),
                 " DROP COLUMN IF EXISTS geom_ewk"))
  dbExecute(conn, query)
}
edzer commented 7 years ago

I now added st_write_db_fast to sf, corrected a few things and added some stuff from st_write_db (srid and geom type and dimension handling). @etiennebr : a replacement candidate for st_write_db?

etiennebr commented 7 years ago

I think so! With a PR we could see if it pass unit test.

jmlondon commented 7 years ago

test results from my machine/setup:

> system.time(
+   sf::st_write_db_fast(con, pts_50k, table = "pts_50k")
+ )
   user  system elapsed 
  0.520   0.010   1.659 
> dbDisconnect(con)

and just to make sure it scales well, with 500K points

> system.time(
+   sf::st_write_db_fast(con, pts_500k, table = "pts_500k")
+ )
   user  system elapsed 
  6.892   0.122  12.164 
> dbDisconnect(con)

I also agree this could be a replacement candidate for st_write_db and excited to see that we can incorporate some speed improvements. Thanks @edzer for sorting out the binary option.

edzer commented 7 years ago

still, st_write (using GDAL) is still faster than st_write_db, in my experiment (2.3 instead of 3.2 secs for 155000 points)

edzer commented 7 years ago

Close this issue?