Any way to speedup "dissolve" ( group_by() %>% summarise() ) with large datasets

monsieurxu commented 6 years ago

As i read previous issues discussion i'm interested in using group_by() %>% summarise() to "dissolve" geometries. But with some data i use (.shp approx 950Mo), it is sooooo slow i'd like to find a way to speedup these operations. The code and data (it is some opendata) i test. Preparation:

setwd("Downloads/")
library(sf)
library(tidyverse)
devtools::install_github("jimhester/archive")
library(archive)
url<-"https://www.data.gouv.fr/s/resources/monreseaumobile/Cartes-enrichies_voix_sms_2G_2018_01_01.7z"
dest<-"Cartes-enrichies_voix_sms_2G_2018_01_01.7z"
download.file(url,dest)
archive_extract(dest)
orange2G<-st_read(dsn="Orange/OF_couv_2G_2017-12-31.shp", crs=2154)

When i try to "dissolve/simplify" the geometry with

orange2G_simple<-orange2G %>% group_by(operateur, date, techno, usage, dept, niveau) %>% summarise()

it seems to be very very slow and after some hours, i killed the process. To compare, i could do it with ArcGIS in approximately half an hour.

So i decided to test on a smallest region (in France, departements)

orange2G_51<-orange2G %>% filter(dept=="51")
system.time(orange2G_51 %>% group_by(operateur, date, techno, usage, dept, niveau) %>% summarise())

After minutes i could verify this is working

utilisateur système écoulé 999.161 21.117 1020.202

But, as there is approximately 95 departements, i find it overall slow in comparison. Is there a way to speedup those operations ? (parallel processing? other ways?)

edzer commented 6 years ago

sf uses a single thread, we did some multi-core experiments but so far with unclear conclusions, and reproducible examples are welcome. Could you please edit your post such that the executable parts are correct R code?

As of the comparison with ArcGIS: please specify how many cores the computer has, and how many were being used by ArcGIS during this half hour - regrettably that is all that can be observed about how it does stuff from the black box software. A comparison with e.g. PostGIS might be more insightful.

edzer commented 6 years ago

I am seeing

> download.file(url,dest)
trying URL 'https://www.data.gouv.fr/s/resources/monreseaumobile/Cartes-enrichies_voix_sms_2G_2018_01_01.7z'
Content type 'application/x-7z-compressed' length 1385849735 bytes (1321.6 MB)
=======================================
downloaded 1032.9 MB

Error in download.file(url, dest) : 
  download from 'https://www.data.gouv.fr/s/resources/monreseaumobile/Cartes-enrichies_voix_sms_2G_2018_01_01.7z' failed
In addition: Warning messages:
1: In download.file(url, dest) :
  downloaded length 1083115964 != reported length 1385849735
2: In download.file(url, dest) :
  URL 'https://www.data.gouv.fr/s/resources/monreseaumobile/Cartes-enrichies_voix_sms_2G_2018_01_01.7z': status was 'Transferred a partial file'

The first time I tried the filesize was slightly different, but very close. Any ideas? Limit on the download size for this service?

monsieurxu commented 6 years ago

1- it does seem ArcGIS is not using multi-core (only 8% CPU load). i have some screen caps but not uploaded it yet, as i should in the next days configure/access a postgis server to test via postgis

2- as far i know, there is no download size limit on this site as it is a french government site to promote open data and enhance reutilizations. Trying to replicate your trouble today, i have the same trouble with R or Safari which was not the case a few days ago. i'm trying to get much information/confirmation from a guy working on this site

edzer commented 6 years ago

Another thought I had: R does everything in memory, so it is useful if you state the amount of memory you have on the machine you used. If memory is full but swap space is not, everything becomes very slow.

jannes-m commented 6 years ago

Just as a sidenote. If you already have used ArcGIS, then you can also use it through R with the help of the RPyGeo package. Open-source GIS bridges to R (all providing dissolving geoalgorithms) are rgrass7, RSAGA and RQGIS.

rsbivand commented 6 years ago

group_by takes <10s. summarize is the problem, and both make life hard compared to alternatives. If the intermediate product wasn't a tbl or grouped_tbl, things might be easier. There is a NULL group too. My script so far (16MB):

library(sf)
orange2G <- st_read(dsn="frtele/Orange/OF_couv_2G_2017-12-31.shp")
orange2G1  <- orange2G
st_geometry(orange2G1) <- NULL
system.time(res <- tapply(1:nrow(orange2G1), list(orange2G1$operateur, orange2G1$date, orange2G1$techno, orange2G1$usage, orange2G1$dept, orange2G1$niveau), c))
#   user  system elapsed 
#  3.374   0.001   3.386 
system.time(res1 <- aggregate(1:nrow(orange2G1), list(orange2G1$operateur, orange2G1$date, orange2G1$techno, orange2G1$usage, orange2G1$dept, orange2G1$niveau), head, n=1))
#   user  system elapsed 
#  4.672   0.000   4.702 
str(res1)
res[which(sapply(res, is.null))] <- NULL
all.equal(res1$x, sapply(res, "[", 1))
# [1] TRUE
polys <- st_geometry(orange2G)
out_polys <- vector(mode="list", length=nrow(res1))
system.time(for (i in seq(along=out_polys)) out_polys[[i]] <- st_union(polys[res[[i]]]))

The final line is running; I guess not doing the union (something like 10000 polygons in each step) would be possible, I've no idea what the output of 288 groups signifies, and some of the grouping variables seem to be singular. I'll report back (possibly after the weekend) on the union step timings. Are the polygons base station Dirichlet polygons?

rsbivand commented 6 years ago

Most of the group_by was unnecessary, as operateur, date, techno, usage only have one value. dept has 96, and niveau has 3. So the unioned polygons are dept by BC, CL or TBC. Any meaning in this?

monsieurxu commented 6 years ago

Thank you for your help!

To be precise, this data is mobile coverage from operators and validated by french regulator Arcep. The 3 classes CL BC TBC are quality-class coverage data for voice/SMS services. In French, CL ("Couverture Limitée") is "Poor coverage (outdoor voice/SMS)" BC is ("Bonne Couverture") "Good Coverage (incar voice)" and TBC (Très Bonne Couverture) "Very Good Coverage (indoor voice)". 287/288 polygons (3 quality-class coverage * 95 departments) is the aim i'd like to get in order to cross this data between operators and other data (population, Corine Land Cover & more)

By the way there are other datasets for 3G/4G coverage but not yet with quality-class granularity here

monsieurxu commented 6 years ago

group_by takes <10s. summarize is the problem,

I found this too when i wanted to get quality-class niveau areas by departements

system.time({ 
orange2G$area<-st_area(orange2G) 
orange2G<-st_set_geometry(orange2G,NULL)
synth_by_dept<-orange2G %>% group_by(dept,niveau) %>% summarise(superficie=sum(area))
result<-spread(synth_by_dept,niveau,superficie)
View(result)
})

I was obliged to set geometry to NULL to get affordable time for treatment (approx 14 seconds)

rsbivand commented 6 years ago

The st_union() step took 35 hours, the numbers of polygons being dissolved were:

> summary(sapply(res, length))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1    4168    7572    8423   11601   36430

My guess would be that divide and conquer (dissolve contiguous subsets first, then dissolve these, etc.) would work faster. The heuristic would need a graph-splitting step, and the split graph could be dissolved using parallel::mclapply(). rgeos::gUnarySTRtreeQuery() took 0.6s to find overlapping bounding boxes for the group with 36430 polygons, which would give the basis for building a graph (spdep::poly2nb() took 37s to build the graph). How to get to an "igraph" object from there is covered in an spdep vignette. Anyone good at graphs (to partition into contiguous clusters)?

monsieurxu commented 6 years ago

@edzer Did you finally get the datas or do you need a link with a subset of it?

edzer commented 6 years ago

No, that worked out after a few fail attempts.

jeffreyhanson commented 6 years ago

Inspired by @rsbivand's suggestions, I tried putting together a multi-threaded version of the aggregate function that partitions a simple features object into spatially distinct subsets using graph theory so they can be aggregated in parallel. I've uploaded the code to RPubs along with a quick benchmark.

The benchmark shows that my implementation of the parallel aggregate is marginally faster than the aggregate function in sf for a moderately sized data set, but hopefully its actually much faster for large data sets.

Also, it looks like I might be misunderstanding how rgeos::gUnarySTRtreeQuery works, because it looks like some neighbouring geometries aren't dissolved together (e.g. look at Africa in the example)? Does anyone have any ideas on what I'm doing wrong?

monsieurxu commented 5 years ago

The notification of closing this issue reminds me to post an url where ESRI explains without details how their software is working for large datasets http://webhelp.esri.com/arcgisdesktop/9.3/index.cfm?TopicName=Tiled+processing+of+large+datasets

monsieurxu commented 5 years ago

By the way, i recently read 2 papers which seems interesting for parallel/gpu computing in polygon clipping:

A Heterogeneous High Performance Computing Framework For Ill-Structured Spatial Join Processing (2018) https://scholarworks.gsu.edu/cgi/viewcontent.cgi?article=1148&context=cs_diss
Efficient Parallel and Distributed Algorithms for GIS Polygon Overlay Processing (2015) https://scholarworks.gsu.edu/cgi/viewcontent.cgi?article=1099&context=cs_diss

edzer commented 5 years ago

Do they mention any open source implementations?

monsieurxu commented 5 years ago

Yet i did not find any (i did not spend time for either...) but i think it's worth contacting them

edzer commented 5 years ago

Feel free to report back your findings!

TroutCasey commented 2 years ago

I think this stack question provides interesting context for this discussion.

I left a comment on the answer addressing how slow summarize has been for me (below).

Is the sf package using GEOS when you call the summarize() function? I don't understand what is happening under the hood when I do a dissolve using summarize().

COMMENT I was using dplyr::summarize() on a large (3.5 gb) file with just shy of 1 million polygons to do a dissolve. In the past it took 16 hours. Based on this post I thought I would try using rgeos::gUnaryUnion(). When I downloaded the rgeos package it said it would be retired so I tried the new geos package (downloaded from git) and converted my sf object using geos::as_geos_geometry() and then did the dissolve using geos::geos_make_collection() %>% geos::geos_unary_union() then converted back to sf using sf::st_as_sf(). The dissolve using the geos package only took 80 minutes.

edzer commented 2 years ago

Do you have a geodetic CRS (long/lat) or projected coordinates? What does sf_use_s2() say?

TroutCasey commented 2 years ago

I am using projected coordinates - ESRI:102003. sf_use_s2() prints TRUE.

Another thing I don't understand is that the sf object in the environment is 3205383328 bytes while the geos geometry produced from converting the sf object to geos using as_geos_geometry() is only 57168024 bytes.

I did look at both dissolved outline files in QGIS - the one produced using summarize() and the one produced using geos_make_collection() %>% geos_unary_union() - and they look identical so I think I am getting the same result.

edzer commented 2 years ago

OK, since

> st_is_longlat(st_crs("ESRI:102003"))
[1] FALSE

you're using projected coordinates, so GEOS (not s2).

r-spatial / sf

Any way to speedup "dissolve" ( group_by() %>% summarise() ) with large datasets #661