Closed jmlondon closed 7 years ago
@etiennebr any suggestions here?
Just out of curiosity, did you try
sf::st_write_db(con, pts_50k, table = "pts_50k", binary = FALSE)
?
> system.time(
+ sf::st_write_db(con, pts_50k, table = "pts_50k",binary=FALSE)
+ )
> user system elapsed
> 64.070 1.941 503.445
I agree with you. Have a look at 23f00d7 (there's a PR #279). It's a similar approach to minimize db calls. I saw some good improvements and certainly we can do even better. Also, RPostgres might provide an interface to create that table in a single call, but I think it's not ready yet.
I was thinking about setting up a benchmark to try multiple approaches. I believe current test coverage for databases doesn't cover all cases yet but it still provides a good test, so approaches that pass should all be considered for performance. Also, we've discussed of a database-specific approach, so that should open the door to database-specific solutions.
Glad to see there has been some progress on the performance issue. I'm regularly dealing with 500,000+ point datasets db write performance is a common bottleneck. I'm happy to contribute where I can and help test/benchmark. I'll give your version of st_write_db a try this week.
I noticed that PR #279 had been merged into master and so took some time to test with my example.
system.time(
sf::st_write_db(con, pts_50k, table = "pts_50k")
)
> user system elapsed
> 0.953 0.073 458.849
Note, at least for this example, performance is still slow.
With that in mind, I developed a generalized version of my previous function/example, st_write_db_fast
. I had to drop the binary
and append
options, but all the other capabilities remain. I think append
could be added, but the binary
option doesn't work b/c it isn't supported in RPostgreSQL
. And, sending all the data via RPostgreSQL
supported data types is where we gain the performance.
st_write_db_fast = function(conn = NULL, obj, table = substitute(obj),
geom_name = "geom",
..., overwrite = FALSE,
debug = FALSE) {
DEBUG = function(x) { if (debug) message(x); x }
if (is.null(conn))
stop("No connection provided")
table <- sf:::schema_table(table)
if (sf:::db_exists(conn, table)) {
if (overwrite) {
DBI::dbGetQuery(conn, DEBUG(paste("drop table",
paste(table, collapse = "."), ";")))
} else {
stop("Table ", paste(table, collapse = "."),
" exists already, use overwrite = TRUE",
call. = FALSE)
}
}
df = obj
df$geom_ewkt <- sf::st_as_text(df$geom,EWKT=TRUE)
df[[attr(df, "sf_column")]] = NULL
class(df) = "data.frame"
dbWriteTable(conn, table, sf:::clean_columns(df, factorsAsCharacter = TRUE), ...)
query = DEBUG(paste0("ALTER TABLE ", paste(table, collapse = "."),
" ADD COLUMN ", geom_name," geometry(Geometry)"))
dbExecute(conn, query)
query = DEBUG(paste0("UPDATE ", paste(table, collapse = "."),
" set ", geom_name," = ST_GeomFromEWKT(geom_ewkt);
ALTER TABLE ", paste(table, collapse = "."),
" DROP COLUMN IF EXISTS geom_ewkt"))
dbExecute(conn, query)
}
running the previous example with this function
system.time(
st_write_db_fast(con, pts_50k, table = "pts_50k")
)
> user system elapsed
> 2.912 0.026 3.363
Given the performance improvements, I'd like to see this somehow incorporated into sf
. I'm not sure if doing this as a separate st_write_db_fast
function or to merge into st_write_db
with a fast=TRUE
option is the best approach. Either way, I'm happy to flesh out and test further and create a pull request.
Running the following script
library(sf)
demo(nc,ask=F,echo=F)
g = st_geometry(nc)
n = 100
fn1 = function() {
for (i in 1:n)
x = st_as_text(g)
}
fn2 = function() {
for (i in 1:n)
x = st_as_binary(g, hex = TRUE) # hex = TRUE now works
}
system.time(fn1())
system.time(fn2())
I see that in sf, writing to binary is 10 x as fast as writing to text. In addition, with WKT transfer it's quite likely you won't get clean roundtrips R->PG->R or the other way around. So although I'm happy about the speedups you see, I'm quite curious where the bottleneck in the binary transfer is.
Think I got it; after updating sf
, please try out the following on your dataset, and compare binary=FALSE
with binary=TRUE
:
st_write_db_fast = function(conn = NULL, obj, table = substitute(obj),
geom_name = "geometry",
..., overwrite = FALSE,
debug = FALSE, binary = TRUE) {
DEBUG = function(x) { if (debug) message(x); x }
if (is.null(conn))
stop("No connection provided")
table <- sf:::schema_table(table)
if (sf:::db_exists(conn, table)) {
if (overwrite) {
DBI::dbGetQuery(conn, DEBUG(paste("drop table",
paste(table, collapse = "."), ";")))
} else {
stop("Table ", paste(table, collapse = "."),
" exists already, use overwrite = TRUE",
call. = FALSE)
}
}
df = obj
df$geom_ewk <- if (binary) {
sf::st_as_binary(df[[geom_name]], EWKT = TRUE, hex = TRUE)
} else
sf::st_as_text(df[[geom_name]], EWKT = TRUE)
df[[attr(df, "sf_column")]] = NULL
class(df) = "data.frame"
dbWriteTable(conn, table, sf:::clean_columns(df, factorsAsCharacter = TRUE), ...)
query = DEBUG(paste0("ALTER TABLE ", paste(table, collapse = "."),
" ADD COLUMN ", geom_name," geometry(Geometry)"))
dbExecute(conn, query)
dbExecute(conn, DEBUG("set standard_conforming_strings = on;"))
query = if (binary)
DEBUG(paste0("UPDATE ", paste(table, collapse = "."),
" set ", geom_name," = ST_GeomFromEWKB(cast(geom_ewk as geometry));
ALTER TABLE ", paste(table, collapse = "."),
" DROP COLUMN IF EXISTS geom_ewk"))
else
DEBUG(paste0("UPDATE ", paste(table, collapse = "."),
" set ", geom_name," = ST_GeomFromEWKT(geom_ewk);
ALTER TABLE ", paste(table, collapse = "."),
" DROP COLUMN IF EXISTS geom_ewk"))
dbExecute(conn, query)
}
I now added st_write_db_fast
to sf, corrected a few things and added some stuff from st_write_db
(srid and geom type and dimension handling). @etiennebr : a replacement candidate for st_write_db
?
I think so! With a PR we could see if it pass unit test.
test results from my machine/setup:
> system.time(
+ sf::st_write_db_fast(con, pts_50k, table = "pts_50k")
+ )
user system elapsed
0.520 0.010 1.659
> dbDisconnect(con)
and just to make sure it scales well, with 500K points
> system.time(
+ sf::st_write_db_fast(con, pts_500k, table = "pts_500k")
+ )
user system elapsed
6.892 0.122 12.164
> dbDisconnect(con)
I also agree this could be a replacement candidate for st_write_db
and excited to see that we can incorporate some speed improvements. Thanks @edzer for sorting out the binary option.
still, st_write
(using GDAL) is still faster than st_write_db
, in my experiment (2.3 instead of 3.2 secs for 155000 points)
Close this issue?
Even within the previous
sp
/rgdal
framework, I've always found the time it takes to write large spatial objects to postgis frustratingly slow. 1-25k records is tolerable, but 100k+ requires more patience than I normally have. I had hopedsf
's adoption of simple features would improve performance, but it appears there is still a non-linear degradation in write performance as the size of the dataset increases.As an example:
Near as I can tell, the performance hit comes from the rowwise UPDATE calls setting the geometry for each row in the table. There may be some very good reasons for this approach, but the following code demonstrates another approach that relies on the postgis
ST_GeomFromEWKT
(could also useST_GeomFromWKT
orST_GeomFromWKB
) function to achieve the same result in significantly less timeI generally accept that folks who know more than me have spent more time thinking about these things than I have. So, I won't be surprised to learn there are some problems with this approach that I haven't considered. But, given the increasing size of spatial datasets, it seems worth the effort to explore performance improvements like this. I'd be happy to flesh out a more generalized approach that could either replace
st_write_db
or serve as an alternative.