rspatial / geosphere

R geosphere package
GNU General Public License v3.0
34 stars 7 forks source link

Vectorized distance? #2

Closed rrmn closed 5 years ago

rrmn commented 5 years ago

Hi Robert,

I'm working with large volumes of data where I calculate geodesic distances between spatial points, usually distances from one observation (= row) to the next. Currently, the datasets amounts to ~500 million rows each.

For this type of problem, most threads on Stackoverflow recommend a rowwise (tidyverse) approach with one of geosphere's functions, or trying a simpler algorithm (e.g., by projecting the data).

Due to the need of fast performance (i.e., minutes, not days) and accurate calculation (i.e., at least geodesic) I made a quick & dirty vectorized implementation of your fantastic distGeo function (the inverse problem, that is).

https://github.com/RomanAbashin/distGeo_v

What do you think? Maybe it would make sense to think about incorporating something like this into geosphere? Or did I miss an approach that was already present?

rhijmans commented 5 years ago

Hi Roman, thank you, but distGeo is vectorized, Have you used it like below?

library(geosphere)
f_geo <- function(m) {
 distGeo(m[-nrow(m),], m[-1,])
}

n <- 1000000
m <- cbind(x = runif(n, -180, 180), y = runif(n, -60, 60))
system.time(f_geo(m))
#   user  system elapsed 
#1    1.92    0.04    1.95 

or the simplest (where the above is done automatically)

d <- distGeo(m)

There is a little overhead in distGeo mainly having to do with recycling that you may not need, and could skip, but that part is probably very fast.

rrmn commented 5 years ago

HA! That's pretty cool! Thank you a lot. If you would want to add this to a dataframe / tibble it would then be something like

m$dist <- c(0, distGeo(m))

Right?

And, to put a cherry on top, would you have an idea how to integrate this into a tidyverse pipe — maybe even with named columns?

rhijmans commented 5 years ago

If would be

m$dist <- distGeo(m)

I am not tidy-versed so I cannot help you there.

rrmn commented 5 years ago

The resulting vector is of length = nrow(m)-1, therefore you cannot straight up assign it to a column in a data frame, can you? (Also it does not work with nrow(m) <= 2, but that's probably a tiny limitation)

rhijmans commented 5 years ago

Ah, yes, I am using the github version which returns a vector of length = nrow(m) and catches zero or 1 row cases. R

On Sat, May 25, 2019 at 12:59 PM Roman Abashin notifications@github.com wrote:

The resulting vector is of length = nrow(m)-1, therefore you cannot straight up assign it to a column in a data frame, can you? (Also it does not work with nrow(m) <= 2, but that's probably a tiny limitation)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/rspatial/geosphere/issues/2?email_source=notifications&email_token=ACXTCNPDP47FWFMICPM2GFTPXGLDFA5CNFSM4HPUC3CKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWHYH2A#issuecomment-495944680, or mute the thread https://github.com/notifications/unsubscribe-auth/ACXTCNPSXKOGDDHSYZLRKV3PXGLDFANCNFSM4HPUC3CA .