rodazuero / gmapsdistance

Future
GNU General Public License v3.0
75 stars 40 forks source link

Trying to find the distances on a Massive dataset #26

Open gndoshi opened 7 years ago

gndoshi commented 7 years ago

Hey guys, First of all, thank you for this lovely package. It's great and does the job fantastically! Great job!

Second, so, I'm trying to find the distances between the LAT-LONs of taxi rides in Manhattan. As you might have guessed, this is a large data set with over 1.4 million observations. So in this data set there are 4 variables, each representing the pick-up and drop-off latitudes and longitudes. (LATs and LONs are different variables, giving 4 variables in total).

So, I'm trying to find the distances traveled by the taxis in all trips. I tried using : train <- mutate(train, distance = gmapsdistance(origin = paste(train$pickup_latitude,"+",train$pickup_longitude,sep = ""), destination = paste(train$dropoff_latitude,"+",train$dropoff_longitude,sep = ""), mode = "driving")[[2]])

but it says that the vector is too large, several GB, which means I can't compute it entirely. Then I did what I didn't prefer to:

calc_distance <- NULL

for(i in 1:dim(nyctaxi)[1]{ calc_distance[i] <- gmapsdistance(origin = paste(nyctaxi$pickup_latitude[i],"+",nyctaxi$pickup_longitude[i],sep = ""), destination = paste(nyctaxi$dropoff_latitude[i],"+",nyctaxi$dropoff_longitude[i], sep=""), mode="driving")[[2]] }

This gave some distances but randomly, sometimes at i=42, sometimes i=277, i = 176, etc, it would give me this message: "Error in data$Time[i] = as(xmlChildren(results$row[[1]])$duration_in_traffic[1]$value[1]$text, : replacement has length zero"

The thing is that it worked perfectly for the i=41, or 276 or 'wherever it stopped - 1' iterations, and gave the correct distances.

Is there anyway you could help me out? How would you calculate the distances of over 1.4 million rows? I would hugely appreciate any sort of help from you guys, thank you so much for taking the time!

Best, Gautam Doshi

gndoshi commented 7 years ago

Anyone?? :( :(

diablo312 commented 7 years ago

I am facing same issues when I run gmapsdistance in loop. When I run it individually it works perfectly fine. I have to manually start the loop from where I got error. Such a painful process. I also tried with a billed API key. Still same issue. Resolution to this will definitely be helpful.

raochaithanya commented 7 years ago

Hello,

Even I am facing the same issue, the package works butter smooth for like sets of 50 addresses, but when I increase the array size to just 100, I get this error:

Error in data$Time[i] = as(xmlChildren(results$row[[1]])$duration_in_traffic[1]$value[1]$text, : replacement has length zero

If someone has insights on how to fix this, it would be really appreciated.

Best Regards, Chaithanya

rodazuero commented 7 years ago

First of all, sorry about the late reply. Please note that there is a limit in the number of queries that you can do. For instance, you can only do 2,500 queries per day, and you can only do 100 elements per second. This is not a restriction of the package but because of the limits on the Google Maps Distance Matrix API. More information about this can be found here: https://developers.google.com/maps/documentation/distance-matrix/usage-limits

You can enable billing if you are willing to pay for additional services with Google. I hope this solves your questions. Let us know if it doesn't.

Thanks for using the package and for your comments about the package. Once again, sorry about late reply but we've been busy at work.

raochaithanya commented 7 years ago

Hello, Thanks for reverting back.

The error was not in data or the number of elements, but it was with data. When I passed larger arrays, some of those had garbage data such as: ' "13937+Monroe's+Business+Park+Tampa+FL+33635" or even spaces that resulted in an error.

Thanks again! Chaithanya

diablo312 commented 6 years ago

Thanks for responding. I did use the package with a billed API - still the same issue.

Demetrio92 commented 6 years ago

Someone posted somewhere that we are not using distancematrix properly. I'll keep this one open for now, and do some performance testing later.

@diablo312 your nrow(df)?

Demetrio92 commented 6 years ago

related #44

marfcg commented 5 years ago

Hi everyone! I know this is somewhat late in the game, but I though it would be worth it to the community to share how I have handled the current state of gmapsdistance package with the main usage limitations imposed by Google Distance Matrix API:

So lets say we have two sets of locations origin and destination, populated whatever way you like, where at least one of them has more than 25 locations -- limitation 1 --, or where length(origin)*length(destination) > 100 -- limitations 2 & 3. Here goes a snippet of how to split your client-side requests in a way that guarantees compliance with the rules above -- I'm not claiming that this is the "optimal" solution, just "a" solution using the current state of development of gmapsdistance ;) -- :

require(gmapsdistance)
set.api.key(<---INSERT YOUR API KEY HERE--->)

# Prepare auxiliary variables with the size of each set: 
num.orig <- length(origin)
num.dest <- length(destination)

# Create base matrices for storage:
time.matrix <- data.frame(or=character(), de=character(), Time=numeric())
distance.matrix <- data.frame(or=character(), de=character(), Distance=numeric())
status.matrix <- data.frame(or=character(), de=character(), Status=character())

# Distance Matrix API has request limits that must be respected:
# - Max of 25 origins OR destinations per request;
# - Max of 100 elements per request;
# - Max of 1000 elements per second
# Therefore, we must set a loop to send requests accordingly.
for (i in seq(1,num.orig/25+1)){
  # Split list of origins in groups of 25, up to max.
  print(i)
  i.start <- (i-1)*25 + 1
  i.end <- min(i*25,num.orig)
  orig.tmp <- origin[i.start:i.end]
  for (j in seq(1,num.dest/4+1)){
    # Split list of destinations in groups of 4 (25*4=100 elements per request), up to max.
    if (j %% 10 == 0){
      # Every 10th iteration, sleep for 1 second to ensure that we are not sending
      # more than 1000 elements/sec. 
      # This should be implemented in a smarter way, taking into account time elapsed
      # and not number of requests alone...
      Sys.sleep(1)
      }
    j.start <- (j-1)*4 + 1
    j.end <- min(j*4,num.dest)
    dest.tmp <- destination[j.start:j.end]
    matrix <- gmapsdistance(origin=orig.tmp, destination=dest.tmp, mode='driving', shape="long", combinations="all")
    time.matrix <- time.matrix %>%
      rbind(matrix$Time)
    distance.matrix <- distance.matrix %>%
      rbind(matrix$Distance)
    status.matrix <- status.matrix %>%
      rbind(matrix$Status)
    print(paste0('i=', i, ' ; j=', j))
  }
  # Since we are not counting final chunk sizes, let's be on the safe side and throw away
  # another sec to avoid crossing the 1000 elements/sec rule:
  Sys.sleep(1)
}
marfcg commented 5 years ago

Ideally, the implementation above (well, an optimized version of it) should be done "behind the scenes" by gmapsdistance itself. For instance, upon funcion call, check whether any of the arrays have more than 25 entries or if their product is greater than 100 and combinations=="all". If any of those is true, split the dataset to comply with Google's predefined usage limits.

cuomopeter commented 4 years ago

Aaaaaaannnnyyy chance you'd be able to write this in Python?

KalyEmbTech commented 1 year ago

Hello, @marfcg many thanks for this, it saved me the time to write down this 👍 @cuomopeter here is the python version I have used in my case :) create_distance_matrix.txt