Vectorize input to geocode multiple values at once

jessesadler commented 6 years ago

In my experience, geocoding with opencage has produced useful and accurate results. However, queries have to be made one at a time right now. I propose to vectorize the input as discussed in issue #19. Vectorization of input also brings up the form of the results. A list of lists of results does not seem optimal to me, and I think it would be better to have output be in the form of a tbl_df. Vectorization of input and form of output could be dealt with in one function or separately. I am working on a pull request for this using purrr.

maelle commented 6 years ago

Fantastic, thanks!

I think it's even ok to have breaking changes for the two current functions, and to return results as a data.frame directly with the rest as its attributes, I'll try to change it soon unless this PR tackles this.

dpprdan commented 6 years ago

👍 I have also been thinking about this lately. I have made some rough notes, which might be helpful regarding some potential pitfalls.

I had two functions each for forward and reverse geocoding in mind. A "basic" function that just returns the API results, only parsed by jsonlite::fromJSON (like rmapzen::mz_search() for example). That way, the user can also convert to other data structures/formats (sf/sp). This could also have the urlonly option I mention in the notes for example. And then a oc_forward_df function that returns a dataframe (and possibly bind_cols it to the source dataframe). So basically split up the retrieve and the formatting part.

While we are at breaking changes, change the opencage_ prefix to oc_? We could also just deprecate the existing functions (build on the new "basic" function and just keep the existing formatting), so as not to break existing code (I wouldn't mind though).

maelle commented 6 years ago

Good point on return results directly. One could even returns them as JSON, in which case I'd add a short jqr example.

Changing the prefix is an excellent idea! One could even let the old functions in the package for at least a while.

dpprdan commented 6 years ago

Don't really know jqr but yeah that sounds handy (if only for me to get to know more about it 😉 ). rmapzen::mz_search() also returns a (Geo)JSON btw (or it says it does, I haven't checked).

"let the old functions in the package": That's what I meant (plus add a deprecate message).

freyfogle commented 6 years ago

Hi everyone. Ed from OpenCage here. Great to see you all working to improve this software, thank you.

Though I am not an R programmer, I have a few comments that may be helpful.

First up, please know that our decision not to build native support for "batch geocoding" is absolutely intentional. One of the key design goals of our service is ease of use for the developer, on the theory that developer time is much, much more valuable than computational time. By sticking to the single query, single response model the service is much easier to understand and debug.

That is not to suggest that we do not see value in a batch method, but I just wanted to share why we took this explicit decision.

Secondly, if someone fires up a batch service that starts hammering our servers, we will of course have to block them. That's the only way we can ensure the service remains usable for everyone (not least paying customers, without whom the service would not exist). So, while I have every faith you are designing this with the best of intentions, please ensure it does not become a tool for unintentional denial of service attacks. Please do enforce the rate limits of 1 query per second for free trial accounts. Otherwise we will have to. Those who need to geocode more quickly are absolutely welcome to - once they become paying customers.
As a developer myself, I definitely vote in favour of the url-only method to ease debugging.
Final point. Please be aware the response we return differs slightly for free-trial accounts and paying customer accounts. We see that this sometimes trips up library developers (though I believe this module has always handled it well). Specifically the rate information is missing for paying customers, as they have no limits. Please see the docs.

Let me close by thanking you all once again for your contributions to this module. Please don't hesitate to ping me or my colleagues if you have any questions. If you like we can gladly do a blog post once the new version is ready.

happy geocoding, Ed

maelle commented 6 years ago

Good points, thanks. Yes the calls would still need to happen sequentially.

I do remember the "rate" being absent for unlimited accounts because there used to be a bug in this package because of that. 😸

dpprdan commented 6 years ago

@freyfogle Thanks for your comments! I guess batch geocoding actually is a misnomer here. What I meant is that "easily geocode multiple addresses and add the returned data to the data(frame)". This is probably the most frequent use case of geocoding in R IMO (see e.g. Jesse's excellent post on Geocoding in R, using the Google API there, though). But like @maelle said, the actual geocoding would still happen sequentially.

Regarding rate-limiting: I mention that in my notes already as a further improvement for the package.

That makes me think: We probably should add a user-agent to the package, so you or your colleagues can reach out to us, if anything should not go as intended (or even notice that the package is causing it).

freyfogle commented 6 years ago

yes, user agent is a great idea. Many thanks.

jessesadler commented 6 years ago

The approach that I have is actually very similar to @dpprdan. It uses purrr::map around the current functions. This does not mess with the current JSON output style, though this could obviously be done as @maelle notes. I do not know anything about parsing JSON, but my thought would be to output data frames, which can be easily converted to either sp or sf.

I have the same problem that @dpprdan has in his notes on geocoding places that do not match, but I am working on a way around this.

dpprdan commented 6 years ago

Great! I'll try to code up a first version of a new basic opencage_forward, tonight, so you could build on that then?

So everyone's fine with adding purrr as a dependency? I certainly am. @maelle Shall we shorten all "opencage" prefixes to "oc" except for the current opencage_forward and opencage_reverse? And could you add a develop branch to which we can merge the PRs and test before merging to master? I think we might have more than just one or two PRs.

maelle commented 6 years ago

I invited you both as collaborators @dpprdan @jesssadler , this way you can create the dev branch.

Yes let's add purrr as a dependency!

maelle commented 6 years ago

And yes reg renaming

jessesadler commented 6 years ago

Thanks @maelle. Wrote up a quick gist to show the current state of my functions so you can see the implementation. https://gist.github.com/jessesadler/0aa2f4b9e067fbb391f502fdef3c4049

jessesadler commented 6 years ago

Some changes in opencage_parse function within utils.R will make vectorization of geocoding function easier and deal with problem of how to handle unidentified places.

Could use dplyr::mutate_if( is.factor, as.character) when converting to tibble. This will get rid of warnings when binding rows.
Instead of making results NULL when no location is returned, could make a data.frame with query = as.character(temp$request$query). This makes it possible to bind data frames of results to those that do not contain results.

dpprdan commented 6 years ago

So, did the renaming and refactoring, so that we have oc_forward and oc_reverse now, that just return a list. I modified opencage_forward and opencage_reverse, so that they use the new functions but return the same as before (added an internal opencage_format, which is supposed to be deprecated with the other two). oc_forward and oc_reverse still need documentation, tests, the urlonly option, rate-limiting, ..., but it's a start.

I tried to push to the main repo, but that did not work. I was able to create the devel branch though, so I created a PR (#29).

That's it from me for today.

maelle commented 6 years ago

Thanks again! I need to look into the access issue. 🤔

jessesadler commented 6 years ago

I opened a new PR (#31) with vectorized geocoding functions for forward and reverse. As I noted in the PR, they build off the old format for parsing the returned JSON that uses lapply. This is a bit at odds with the changes made by @dpprdan, but hopefully this provides an idea of what a solution might look like.

dpprdan commented 6 years ago

I think we've done all this?!

ropensci / opencage

Vectorize input to geocode multiple values at once #27