mbtyers / riverdist

River Network Distance Computation and Applications
23 stars 1 forks source link

Add parallel option to xy2segvert #15

Open colincharles opened 6 years ago

colincharles commented 6 years ago

As we've been accumulating data since 2016, the xy2segvert function has been taking a bit longer with each addition of data from our telemetry experiment.

I was digging through the function and found that the mapply function was taking increasingly longer with more data. I was able to parallelize the mapply function to parLapply using the parallel library. Thought someone might find this useful.

For a general idea, right now we've got ~1.75M data points in our system (Red River - Manitoba, North Dakota, Minnesota) and on my machine the xy2segvert function takes about 6.5 minutes to run. By switching it to parallel and using 8 cores it's down to ~50 seconds and 2 cores was ~3.25 minutes

Here's the changes I added to the function.

xy2segvert_1 = function (x, y, rivers, parallel = FALSE, no.cores = 2){
  if (class(rivers) != "rivernetwork") 
    stop("Argument 'rivers' must be of class 'rivernetwork'.  See help(line2network) for more information.")
  if (any(is.na(x)) | any(is.na(y)) | !is.numeric(x) | !is.numeric(y)) 
    stop("Missing or non-numeric coordinates.")

  # new code
  if(parallel == TRUE & !"parallel" %in% rownames(installed.packages()))
    stop("parallel package not installed - install.packages('parallel')")

  lengthlength <- length(unlist(lines))/2
  whichseg <- whichvert <- allx <- ally <- rep(NA, lengthlength)
  istart <- 1
  for (i in 1:length(rivers$lines)) {
    linelength <- dim(rivers$lines[[i]])[1]
    whichseg[istart:(istart + linelength - 1)] <- i
    whichvert[istart:(istart + linelength - 1)] <- 1:linelength
    allx[istart:(istart + linelength - 1)] <- rivers$lines[[i]][, 1]
    ally[istart:(istart + linelength - 1)] <- rivers$lines[[i]][, 2]
    istart <- istart + linelength
  }
  pdist2 <- function(p1, p2x, p2y) {
    dist <- sqrt((p1[1] - p2x)^2 + (p1[2] - p2y)^2)
    return(dist)
  }

  # add a parallel option when you have a lot of data
  if(parallel == TRUE){
    df1 = data.frame(x = x, y = y)
    cl <- makeCluster(no.cores)
    clusterExport(cl = cl, varlist=c("df1", "pdist2", "allx", "ally"), envir = environment())

    min.i <- parLapply(cl, seq_len(nrow(df1)), 
                       function(i) {
                         x = which.min(pdist2(c(df1$x[i], df1$y[i]), allx, ally))[1]
                         return(x)
                         }
                       )
    stopCluster(cl)
    min.i = unlist(min.i)
  } else { # the code used in the original function
    min.i <- mapply(function(x, y) which.min(pdist2(c(x, y), 
                                                    allx, ally))[1], x, y) 
  }

  seg <- whichseg[min.i]
  vert <- whichvert[min.i]
  snapdist <- sqrt((x - allx[min.i])^2 + (y - ally[min.i])^2)
  out <- data.frame(seg, vert, snapdist)
  return(out)
}
jsta commented 6 years ago

One performance boosting alternative I considered looking into was converting some of these functions to use Rcpp. It is supposed to be very good in these loop counting scenarios.

mbtyers commented 1 year ago

Hello, and (many) apologies for missing these messages somehow!! These are great suggestions and I will work to incorporate them.