jessecambon / tidygeocoder

Geocoding Made Easy
https://jessecambon.github.io/tidygeocoder
Other
283 stars 20 forks source link

Local geocoding database? #39

Open mikedolanfliss opened 4 years ago

mikedolanfliss commented 4 years ago

(Ported from another issue):

While I'm at suggesting things :), there was once headway on the data-science toolkit (rdstk) on using a local master address database to geocode - much more sustainable for big calls when someone can, for instance, download all geocoded addresses in a state or county and work off that (the way SAS or ArcGIS can). That to me feels like a missing keystone to a complete geocode package, though would be a sizable lift. I'll be watching this package and if there are ways I can contribute (ahem, after COVID, since I'm an overworked epidemiologist at the moment) I'd love to. Again, thanks for the work.

I've at times written my own somewhat lazy/hacky string-match to a known census of addresses in a location, but being able to call geo() on a local geodatabase would make this package indispensable for workers with access to local spatial databases but few API creds (since some of the open source / free geocoders are of much lower capacity than google or geocodio).

Again, great work. Just logging some ideas for the future!

izahn commented 3 years ago

https://degauss.org/ might be useful, either directly or for inspiration!

ottothecow commented 2 years ago

A similar/related concept is maintaining a local geocoding database consisting of addresses you have already geocoded.

I've rolled my own simple versions of this for my own work, but I've found it to be very useful in several scenarios:

  1. using a paid service (or rate-limited), you don't have to repeat geocoding on addresses that you have previously geocoded. E.g. say you get a new list every week, but sometimes that list has addresses that were also on last week's list.
  2. testing/modifying code. I tend to avoid building geocoding into programs and instead keep it as a standalone script because I don't want to hit external APIs every time I run a program that I have modified in a way that doesn't change the geocoding results. E.g. Instead of writing a simple program that geocodes some addresses and plots them on a map, I would write a program that prepares the data for geocoding, one that sends it out for geocoding, and one that reads the results in and plots it, but sometimes this makes overall program flow awkward since data cleaning that impacts the geocoding must done first, but data cleaning and filtering that affects the plot has to be separated out and delayed until the end.
  3. large amounts of repeated geocoding--in unison with scenario 1. If you need to re-run large lists of addresses that have a lot of overlap with previous work , even if the API is free or you aren't worried about cost it will be significantly faster to pull in locally cached results before jumping to an external geocoder.

Could store the timestamp of the geocode and have a parameter for how often to refresh the result. Ditto for having an option to force replacing cached results.

What I don't know is the most efficient way to do this within the R/tidygeocoder world. When I've rolled my own, I typically have been making my own API requests directly and simply storing the results in whatever format is most convenient in the language I am using. Caching was done based on exact inputs (so "1 Main Street" and "1 Main St." both get their own result), but that takes care of most of the repetition issues.