ropensci / opencage

:globe_with_meridians: R package for the OpenCage API -- both forward and reverse geocoding :globe_with_meridians:
https://docs.ropensci.org/opencage
87 stars 11 forks source link

Vectorize input to geocode multiple values at once #27

Closed jessesadler closed 6 years ago

jessesadler commented 6 years ago

In my experience, geocoding with opencage has produced useful and accurate results. However, queries have to be made one at a time right now. I propose to vectorize the input as discussed in issue #19. Vectorization of input also brings up the form of the results. A list of lists of results does not seem optimal to me, and I think it would be better to have output be in the form of a tbl_df. Vectorization of input and form of output could be dealt with in one function or separately. I am working on a pull request for this using purrr.

maelle commented 6 years ago

Fantastic, thanks!

I think it's even ok to have breaking changes for the two current functions, and to return results as a data.frame directly with the rest as its attributes, I'll try to change it soon unless this PR tackles this.

dpprdan commented 6 years ago

👍 I have also been thinking about this lately. I have made some rough notes, which might be helpful regarding some potential pitfalls.

I had two functions each for forward and reverse geocoding in mind. A "basic" function that just returns the API results, only parsed by jsonlite::fromJSON (like rmapzen::mz_search() for example). That way, the user can also convert to other data structures/formats (sf/sp). This could also have the urlonly option I mention in the notes for example. And then a oc_forward_df function that returns a dataframe (and possibly bind_cols it to the source dataframe). So basically split up the retrieve and the formatting part.

While we are at breaking changes, change the opencage_ prefix to oc_? We could also just deprecate the existing functions (build on the new "basic" function and just keep the existing formatting), so as not to break existing code (I wouldn't mind though).

maelle commented 6 years ago

Good point on return results directly. One could even returns them as JSON, in which case I'd add a short jqr example.

Changing the prefix is an excellent idea! One could even let the old functions in the package for at least a while.

dpprdan commented 6 years ago

Don't really know jqr but yeah that sounds handy (if only for me to get to know more about it 😉 ). rmapzen::mz_search() also returns a (Geo)JSON btw (or it says it does, I haven't checked).

"let the old functions in the package": That's what I meant (plus add a deprecate message).

freyfogle commented 6 years ago

Hi everyone. Ed from OpenCage here. Great to see you all working to improve this software, thank you.

Though I am not an R programmer, I have a few comments that may be helpful.

That is not to suggest that we do not see value in a batch method, but I just wanted to share why we took this explicit decision.

Let me close by thanking you all once again for your contributions to this module. Please don't hesitate to ping me or my colleagues if you have any questions. If you like we can gladly do a blog post once the new version is ready.

happy geocoding, Ed

maelle commented 6 years ago

Good points, thanks. Yes the calls would still need to happen sequentially.

I do remember the "rate" being absent for unlimited accounts because there used to be a bug in this package because of that. 😸

dpprdan commented 6 years ago

@freyfogle Thanks for your comments! I guess batch geocoding actually is a misnomer here. What I meant is that "easily geocode multiple addresses and add the returned data to the data(frame)". This is probably the most frequent use case of geocoding in R IMO (see e.g. Jesse's excellent post on Geocoding in R, using the Google API there, though). But like @maelle said, the actual geocoding would still happen sequentially.

Regarding rate-limiting: I mention that in my notes already as a further improvement for the package.

That makes me think: We probably should add a user-agent to the package, so you or your colleagues can reach out to us, if anything should not go as intended (or even notice that the package is causing it).

freyfogle commented 6 years ago

yes, user agent is a great idea. Many thanks.

jessesadler commented 6 years ago

The approach that I have is actually very similar to @dpprdan. It uses purrr::map around the current functions. This does not mess with the current JSON output style, though this could obviously be done as @maelle notes. I do not know anything about parsing JSON, but my thought would be to output data frames, which can be easily converted to either sp or sf.

I have the same problem that @dpprdan has in his notes on geocoding places that do not match, but I am working on a way around this.

dpprdan commented 6 years ago

Great! I'll try to code up a first version of a new basic opencage_forward, tonight, so you could build on that then?

So everyone's fine with adding purrr as a dependency? I certainly am. @maelle Shall we shorten all "opencage" prefixes to "oc" except for the current opencage_forward and opencage_reverse? And could you add a develop branch to which we can merge the PRs and test before merging to master? I think we might have more than just one or two PRs.

maelle commented 6 years ago

I invited you both as collaborators @dpprdan @jesssadler , this way you can create the dev branch.

Yes let's add purrr as a dependency!

maelle commented 6 years ago

And yes reg renaming

jessesadler commented 6 years ago

Thanks @maelle. Wrote up a quick gist to show the current state of my functions so you can see the implementation. https://gist.github.com/jessesadler/0aa2f4b9e067fbb391f502fdef3c4049

jessesadler commented 6 years ago

Some changes in opencage_parse function within utils.R will make vectorization of geocoding function easier and deal with problem of how to handle unidentified places.

dpprdan commented 6 years ago

So, did the renaming and refactoring, so that we have oc_forward and oc_reverse now, that just return a list. I modified opencage_forward and opencage_reverse, so that they use the new functions but return the same as before (added an internal opencage_format, which is supposed to be deprecated with the other two). oc_forward and oc_reverse still need documentation, tests, the urlonly option, rate-limiting, ..., but it's a start.

I tried to push to the main repo, but that did not work. I was able to create the devel branch though, so I created a PR (#29).

That's it from me for today.

maelle commented 6 years ago

Thanks again! I need to look into the access issue. 🤔

jessesadler commented 6 years ago

I opened a new PR (#31) with vectorized geocoding functions for forward and reverse. As I noted in the PR, they build off the old format for parsing the returned JSON that uses lapply. This is a bit at odds with the changes made by @dpprdan, but hopefully this provides an idea of what a solution might look like.

dpprdan commented 6 years ago

I think we've done all this?!