hypertidy / geodist

Ultra lightweight, ultra fast calculation of geo distances
https://hypertidy.github.io/geodist/
Other
93 stars 7 forks source link

Start Rcpp branch #16

Open mpadge opened 6 years ago

mpadge commented 6 years ago

Just to compare compilation times for Dirk

njtierney commented 6 years ago

If this helps, here is the current time comparison for haversines distances for geodist and the distance functions in maxcovr, distance_matrix_cpp, which calls spherical_distance_cpp under the hood

library(geodist)
library(maxcovr)
n <- 500
x <- cbind (-10 + 20 * runif (n), -10 + 20 * runif (n))
y <- cbind (-10 + 20 * runif (2 * n), -10 + 20 * runif (2 * n))
colnames (x) <- colnames (y) <- c ("x", "y")

mb1 <- microbenchmark::microbenchmark(
    "geodist" = geodist(x,y, measure = "haversine"),
    "maxcovr" = distance_matrix_cpp(y, x)
)

mb1
#> Unit: milliseconds
#>     expr      min       lq     mean  median       uq      max neval
#>  geodist 25.62657 28.69873 30.82551 30.6244 32.65784 46.65979   100
#>  maxcovr 31.89214 34.29572 36.62601 36.2090 38.45201 44.87097   100
microbenchmark::autoplot.microbenchmark(mb1)
#> Loading required namespace: ggplot2
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.

Created on 2018-07-20 by the reprex package (v0.2.0).

As a side note, it looks like maxcovr has a slightly different implementation of haversines formula to geodist, so their results are not identical.

mkuehn10 commented 5 years ago

I just recently discovered this package as I sometimes have the need to calculate largish distance matrices for spatial data (to examine the closeness of "customers" to "sites"). I've written some hacky code using Rcpp and RcppParallel which seems to work fairly well and quickly. I did some comparisons to the geodist implementation. I don't know if there's any interest in trying to clean that up and add to this package or to some other package. I couldn't really find anything when I was trying to solve this problem several months ago, so I just hacked something together. I use it now in several Shiny apps to show interactive maps that calculate the distance matrices behind the scenes. And now I am looking at your maxcovr package (@njtierney) you made and that seems to be something I would have liked to have seen earlier. :)

This isn't really reproducible since you don't have access to my rcpp_parallel_distm_C function (I'm pretty good at naming things...). I'd be happy to eventually share that code, but I guarantee it can probably be cleaned up and improved, so I'm slightly embarrassed to provide it at the moment!

library(geodist)
library(maxcovr)

set.seed(10)
n <- 500
x <- cbind (-10 + 20 * runif (n), -10 + 20 * runif (n))
y <- cbind (-10 + 20 * runif (2 * n), -10 + 20 * runif (2 * n))
colnames (x) <- colnames (y) <- c ("x", "y")

mb1 <- 
  microbenchmark::microbenchmark(
  geodist = geodist::geodist(x, y, measure = "haversine"),
  maxcovr = maxcovr::distance_matrix_cpp(y, x),
  parallel = rcpp_parallel_distm_C(x, y) * 1609.34)

mb1
Unit: milliseconds
     expr     min        lq      mean    median        uq      max neval cld
  geodist 72.2865  75.36175  79.15762  77.79530  81.18445 108.2466   100  b 
  maxcovr 99.4450 103.09460 108.17821 106.27860 110.95350 136.5662   100   c
 parallel 17.3640  17.53460  20.31995  17.80955  18.29345 232.5047   100 a 
geodist_res <- geodist::geodist(x, y, measure = "haversine")
maxcovr_res <- maxcovr::distance_matrix_cpp(y, x)
parallel_res <- rcpp_parallel_distm_C(x, y) * 1609.34
> all.equal(geodist_res, parallel_res)
[1] TRUE
> all.equal(geodist_res, maxcovr_res)
[1] "Mean relative difference: 0.002942463"
> all.equal(maxcovr_res, parallel_res)
[1] "Mean relative difference: 0.002945609"
mkuehn10 commented 5 years ago

Here is a proof of concept with the C++ code included: https://github.com/mkuehn10/pargeodist

njtierney commented 5 years ago

Thanks for sharing @mkuehn10 ! I'll point to another thread I wrote about here: https://github.com/njtierney/maxcovr/issues/20

mpadge commented 5 years ago

Thanks @mkuehn10 for the input here and there. My initial thought in designing this package was that distance calculation would generally be a small part of larger workflows which would easily be parallelized on the R side of things, so there was no need to parallelize the actual distance calculations within this package. An additional design aim of geodist was to be dependency free, which among other things means no C++ (or at least, no direct need for such here). The parallel issue was one I gave some thought to, because that's obviously where the greatest efficiency gains can be very cheaply harvested. It is effectively impossible to implement parallel options in C, so extending would require C++. That alone need not require Rcpp, but RcppParallel is by far the easiest way, so the package would then indeed require both and no longer be dependency free.

That said, the recent benchmarks of @symbolixAU suggest a simple Rcpp implementation might actually be faster, so I've got some investigating to do regardless -- investigations now in the gist linked to above, and show that the C code is indeed faster, but the R interface could be improved to enable direct access to those speed-ups. The cheapest and easiest way to ensure that this package maintains my claim of "ultra fast" is indeed to parallelize, and so I shall give it some further thought and report back.

mpadge commented 5 years ago

@mkuehn10 I've just put a comment in @symbolixAU's benchmarks that shows how geodist can fairly easily be given around a 3-times 28% (sorry, my mistake there) speed boost, which would put the currently non-parallel roughly in line with your parallel tests (79/3 = 26 compared with your parallel of 20). Imagine what parallel on top of that might do! would reduce your above time of 79 down to around 61.

Note also that the official WSG84 definitions bundled in geodist here are very precise, and that use of less precise values as in your code will also generate (artificial) speed ups. Comparisons can only be fairly made at equivalent precision.

mpadge commented 5 years ago

This is now officially a non-trivial issue, because the benchmark clearly shows that the C code in indeed quite a bit faster than C++, and so ...

The Question

Should C code be converted to (in this case) demonstrably less efficient C++ form in order to allow regains in efficiency through parallelization? I'm tempted to answer "No", simply because the extra layer of complexity introduced in quite significant, but then again it would allow quite a boost for large calculations, so I'm open to other possibilities ...

njtierney commented 5 years ago

I think that the extra gains are worthwhile since they scale well - but perhaps it should be tested in either 1. another branch, or 2. a separate package. The upshot of a separate package is that you can then keep geodist as a dependency free pkg.

njtierney commented 5 years ago

I guess the only other thing I can think of re performance is if there might be any gain in writing it in fortran - but from what I understand the interface to FORTRAN isn't exactly amazing?

mpadge commented 5 years ago

Now that would be an interesting idea ... but nah, I do not want to dig up my fortran skills (had to do a tiny bit of that back in my maths degree time). I doubt that anything would be faster than C. If i were to do it in anything else, it'd be rust, but even then ... nah, easier just to leave it. Speed ups via #22 are already in some use cases pretty significant. I'll still nevertheless do this Rcpp branch to investigate the original purpose, which was compilation times.

mkuehn10 commented 5 years ago

Here is a basic package that I created that implements only the distance matrix functionality for now and only does the haversine and vincenty distances. I aligned the calculations to the way geodist does them and I get identical matrices back from the geodist function and the parallel version now.

https://github.com/mkuehn10/geodistpar

asardaes commented 5 years ago

I came across this repo through a StackOverflow question, and I figured I'd give my 2 cents regarding parallelization. I've had some experience with this in my dtwclust package, because I needed cross-distance matrices with distances targeted at time-series data.

@mkuehn10 if I'm not mistaken, the way you're calling parallelFor is not ideal, because by specifying a grain equal to 1, a new thread is created for each distance calculation. It's much better to perform many calculations in each thread call. Also, since you don't have the Makevars files, you won't be using TBB on Windows.

How you parallelize calculations is closely related to the type of output you want. In my package I implemented pairwise, symmetric and asymmetric calculations. For the symmetric case, I didn't parallelize based on rows or columns, I considered the whole lower triangular as a long vector and divided it into chunks for each thread. I have different distance-matrix fillers for different operations. You can see an example of this logic in this answer I gave on SO.

For my use case, parallelization was very useful, and scaled even linearly in the number of cores for large matrices and some distances. My package includes some benchmarks, which you could see here if you like.

AFAIK, you could also parallelize with something like OpenMP to avoid C++ dependencies, but with that I have no experience. I believe the code can stay the same, and will be parallelized when the user has OpenMP support and stay sequential otherwise.

mkuehn10 commented 5 years ago

I have tried adjusting the grain size, and it only slows down the process. I thought it would actually help, but it did not seem to. I also think 1 is the default, and that's only left over from when I was playing around with that setting. Otherwise, I wouldn't even be putting that in the call to parallelFor.

https://rcppcore.github.io/RcppParallel/#tuning

For the distance matrix, you don't need larger chunks unless I'm mistaken.

I'm no expert, so everything you said could be theoretically true, but I have only seen the speedups by doing what I've currently done. I currently have this method in use in several Shiny apps that I've deployed at work and it definitely makes a difference. I started working on a project over a year ago that required some large distance calculations, and I couldn't find anything that helped, so I tinkered around and wrote my own. I finally decided to just share what I had since it seems to work very quickly.

I am trying to convert this in to a more formal package, so if you have specific feedback, please feel free to leave it here: https://github.com/mkuehn10/geodistpar

asardaes commented 5 years ago

I might have time to send you a quick PR. For now, here's a quick comparison.

File haversine.cpp:

// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(RcppParallel,RcppThread)]]

#include <cstddef> // size_t
#include <math.h> // sin, cos, sqrt, atan2, pow
#include <vector>

#include <RcppThread.h>
#include <Rcpp.h>
#include <RcppParallel.h>

using namespace std;
using namespace Rcpp;
using namespace RcppParallel;

// single to double indices for lower triangular of matrices without diagonal
void s2d(const size_t id, const size_t nrow, size_t& i, size_t& j) {
  // tmp to avoid ambiguity
  double tmp = static_cast<double>(-8 * id + 4 * nrow * (nrow - 1) - 7);
  j = nrow - 2 - static_cast<size_t>(sqrt(tmp) / 2 - 0.5);
  i = id + j + 1 - nrow * (nrow - 1) / 2 + (nrow - j) * ((nrow - j) - 1) / 2;
}

class HaversineCalculator : public Worker
{
public:
  HaversineCalculator(const NumericVector& lon, const NumericVector& lat, NumericMatrix& out)
    : lon_(lon)
    , lat_(lat)
    , cos_lat_(lon.length())
    , out_(out)
  {
    // terms for distance calculation
    for (size_t i = 0; i < cos_lat_.size(); i++) {
      cos_lat_[i] = cos(lat_[i] * 3.1415926535897 / 180);
    }
  }

  void operator()(size_t begin, size_t end) {
    double to_rad = 3.1415926535897 / 180;

    size_t i, j;
    for (size_t ind = begin; ind < end; ind++) {
      if (RcppThread::isInterrupted(ind % static_cast<int>(1e5) == 0)) return;

      s2d(ind, lon_.length(), i, j);

      // haversine distance
      double d_lon = (lon_[j] - lon_[i]) * to_rad;
      double d_lat = (lat_[j] - lat_[i]) * to_rad;
      double d_hav = pow(sin(d_lat / 2), 2) + cos_lat_[i] * cos_lat_[j] * pow(sin(d_lon / 2), 2);
      if (d_hav > 1) d_hav = 1;
      d_hav = 2 * atan2(sqrt(d_hav), sqrt(1 - d_hav)) * 6378137;

      out_(i,j) = d_hav;
      out_(j,i) = d_hav;
    }
  }

private:
  const RVector<double> lon_;
  const RVector<double> lat_;
  vector<double> cos_lat_;
  RMatrix<double> out_;
};

// [[Rcpp::export]]
SEXP haversine(const DataFrame& input, NumericMatrix output, const int nthreads) {
  NumericVector lon = input["lon"];
  NumericVector lat = input["lat"];

  HaversineCalculator hc(lon, lat, output);

  int size = lon.length() * (lon.length() - 1) / 2;
  int grain = size / nthreads;
  RcppParallel::parallelFor(0, size, hc, grain);
  RcppThread::checkUserInterrupt();

  return R_NilValue;
}

Comparison with 4 cores:

library(Rcpp)
library(RcppParallel)
library(RcppThread)
library(geodistpar)

Sys.setenv(PKG_CXXFLAGS = "-DRCPP_PARALLEL_USE_TBB=1")
sourceCpp("haversine.cpp")

df <- data.frame(lon=runif(1e3, -90, 90), lat=runif(1e3, -90, 90))
mat <- as.matrix(df)

library(microbenchmark)

foo <- function(df) {
  out <- matrix(0.0, nrow(df), nrow(df))
  haversine(df, out, RcppParallel::defaultNumThreads())
  out
}

microbenchmark(times = 30L,
               ans1 <- geodistpar(mat, mat, "haversine"),
               ans2 <- foo(df))
Unit: milliseconds
                                      expr      min       lq     mean   median       uq      max neval
 ans1 <- geodistpar(mat, mat, "haversine") 54.48880 54.87569 59.30028 55.91926 60.62092 93.68114    30
                           ans2 <- foo(df) 19.29622 19.37289 21.23298 19.63739 23.43039 26.90884    30

all.equal(ans1, ans2)
TRUE
mkuehn10 commented 5 years ago

Wow. Thanks. Definitely a bit beyond my capability or understanding at this time, but I will definitely take a look.

asardaes commented 5 years ago

Some additional considerations to keep in mind when deciding whether to use RcppParallel:

mpadge commented 5 years ago

Thanks @asardaes for the useful insights. I concur with the Rcpp concerns, and my really would prefer to keep this package as lightweight as possible, which would be not easy if it were to use RcppParallel. In addition to the GNU make issue, I've also got some permanent memory leak issues in another package from RcppParallel/TBB, and this has been there for years and shows no signs of going away.

My current inkling is that parallelisation can and should be handled outside the core routines of geodist. This package includes local optimisations for the way various components of distance calculations are handled for most metrics (but not for geodesic). Neglecting the advantages of these, then large parallel jobs can be implemented by wrapping a parallel::clusterApply or whatever around the desired calculations. If multi-core calculations really do make a huge difference for any given job, then this kind of simple wrapping will be effective. My current assumption is that this will not be the case for most common usage of this package, and single-thread usage will suffice, leaving the package in its current light and clean state.

In this light, I think the real resolution of this issue will come through comparing (1) a package-external parallelisation wrapper with (2) equivalent internal RcppParallel wrappers with (3) current code. My assumption above then amounts to assuming that the first 2 of those 3 will be similar, effectively obviating any need to internalise the parallelisation within this package. I'll try to find time for that comparison within the next couple of weeks. (And note in closing that it'll only make a difference for huge jobs of at least N~O(106).)

At the very least, I suspect that this issue ought to prompt a vignette on how to effectively parallelise geodist calculations, and I am happy for it to remain open until either that is done, or until we do ultimately decide to expand the package dependencies and incorporate internal parallelisation. Thanks for the constructive input!