jessecambon / tidygeocoder

Geocoding Made Easy
https://jessecambon.github.io/tidygeocoder
Other
283 stars 20 forks source link

geocode with method census returns NA if match_indicator = "Tie" #87

Closed davidkreitmeir closed 3 years ago

davidkreitmeir commented 3 years ago

This is the current code that I am running.


          city = res_city_desc,
          state = state_cd,
          postalcode = zip_code,
          method = 'census', 
          full_results = TRUE, 
          return_type = 'geographies',
          unique_only= FALSE,
          flatten = FALSE)```

The issue is that if `match_indicator = "Tie"` (so there seems to be a non-unique match) the returned result nevertheless contains exclusively missing values (`NA`) for all geolocation information (just as if `match_indicator = "No_Match"`.
jessecambon commented 3 years ago

@davidkreitmeir thanks, could you post a reproducible example (https://www.tidyverse.org/help/) ? Also, I think the top of your code might have gotten cut off.

davidkreitmeir commented 3 years ago

@jessecambon thanks a lot for getting back to me. And sorry that part of the code got cut off.

I kept playing around a bit more with the addresses that were originally flagged as "Tie"s and realised that when the batch sizes are really small -- i.e. just 3 addresses-- now 3 matches were found, while before when running the code for a batch size of 100 the addresses returned a "Tie" and NA.

I attached an xlsx file (csv was somehow not allowed) with 100 addresses that were originally flagged as a "Tie" ( and the code below.

Note: A general issue seems to be that the geocoding is instable. In one run it returned for the xlsx file almost exclusively "Tie"s (note that in my original exercise when I looped over batches of 100, all addresses in the xlsx file were all classified as "Tie"s), while in a different session it returned now matches for half of them

library(tidyverse)
library(tidygeocoder)
library(readxl)

example_addresses.short <- tibble(
  res_street_address = c("521 E MAIN ST", "423 E GREENSBORO-CHAPEL HILL RD", "407 S ST JOHN ST"), 
  res_city_desc = c("HAW RIVER", "SNOW CAMP", "BURLINGTON"),
  state_cd = c("NC", "NC", "NC"),
  zip_code = c("27258","27349", "27217")
) %>%
  geocode(street = res_street_address,
          city = res_city_desc,
          state = state_cd,
          postalcode = zip_code,
          method = 'census', 
          unique_only= FALSE,
          #flatten = FALSE,
          full_results = TRUE, 
          return_type = 'geographies'
  )

example_cases.long <- read_xlsx("tidygeocoder_tie_data.xlsx") %>%
  geocode(street = res_street_address,
          city = res_city_desc,
          state = state_cd,
          postalcode = zip_code,
          method = 'census', 
          unique_only= FALSE,
          #flatten = FALSE,
          full_results = TRUE, 
          return_type = 'geographies'
  )

tidygeocoder_tie_data.xlsx

Thanks a lot for your help!

jessecambon commented 3 years ago

@davidkreitmeir thanks for that. Unfortunately, it appears this is just the behavior of the geocoder service. According to the API documentation, a "tie" indicates multiple results for an address. This article mentions that ties return NA results.

I also manually did a batch query and the raw results just contain NAs. However, you can get results if you use mode = 'single' to prevent it from using batch geocoding. Hope that helps.

davidkreitmeir commented 3 years ago

@jessecambon thanks so much for going through the effort. Just to not misunderstand anything: no matter if manual or mode = 'single' ties will return NAs but using mode = 'single' will result apparently return the list ties?

(sorry your code did not run through: always resulted in my R session being aborted so could not check the manual version)

jessecambon commented 3 years ago

@davidkreitmeir

Try this:


library(dplyr)
library(tidygeocoder)

# these addresses should produce ties in batch mode
tie_addresses <- tibble::tribble(
  ~res_street_address, ~res_city_desc, ~state_cd, ~zip_code,
  "624 W DAVIS ST   #1D",   "BURLINGTON",      "NC",     27215,
  "201 E CENTER ST   #268",       "MEBANE",      "NC",     27302,
  "7833  WOLFE LN",    "SNOW CAMP",      "NC",     27349,
)

## Try using tidygeocoder batch --- NA return
tg_batch <- tie_addresses %>%
  geocode(street = res_street_address,
          city = res_city_desc,
          state = state_cd,
          postalcode = zip_code,
          method = 'census', 
          full_results = TRUE, 
          return_type = 'geographies'
  )

## Try using single address geocoding - lat longs returned succesfully
tg_single <- tie_addresses %>%
  geocode(street = res_street_address,
          city = res_city_desc,
          state = state_cd,
          postalcode = zip_code,
          method = 'census', 
          mode = 'single',
          full_results = TRUE, 
          return_type = 'geographies'
  )

The first query uses Census batch geocoding and returns "tie" results with NA coordinates. The second query uses mode = 'single' to force single address geocoding (ie. one query per address) and this query produces results. It seems for whatever reason the Census batch geocoder just returns NA data when there are multiple possible results, while the single address geocoding mode still returns results.

Separately, I'm working on a limit argument pass through for Census single address geocoding that would allow you to control how many results you want to return, but I'm currently seeing some weird behaviour with the order of the results.

jessecambon commented 3 years ago

Also this script should work now. There was a typo that was causing an issue. Here's the raw results coming from the Census batch geocoder for those addresses:

Response [https://geocoding.geo.census.gov/geocoder/geographies/addressbatch] Date: 2021-03-18 20:17 Status: 200 Content-Type: text/plain Size: 159 B "1","624 W DAVIS ST #1D, BURLINGTON, NC, 27215","Tie" "2","201 E CENTER ST #268, MEBANE, NC, 27302","Tie" "3","7833 WOLFE LN, SNOW CAMP, NC, 27349","Tie"

jessecambon commented 3 years ago

@davidkreitmeir FYI I reached out to the Census geocoding team and received this response:

When the batch geocoding service encounters a Tie for an address it will only return the Tie match indicator and will not include data for the multiple matched addresses. When a tie occurs, it is expected the user will use the single address call to view the multiple matches and decide on which address match is correct. Please let us know if you have any questions.

davidkreitmeir commented 3 years ago

@jessecambon thanks so much for all your efforts! It is a weird choice by the Census geocoding team to not just include the multiple results in the batch geocoding as well imo but I guess from a workflow perspective it is one extra step but inefficient step (get results with match_indicator = "Tie" and then rerun the analysis for them with mode = "single").

Which brings me to a related point: I was trying to look into how to most efficiently geocode a large number of addresses. From stopping the time for different options (with tictoc), I found that running batch queries with size = 100 in parallel seems to be a good way to go (over non-batch/single queries and running them in parallel or larger batch sizes). But I thought it's wise to draw on your expertise here: Have you any experience/recommendation how to most efficiently geocode large amounts of data with the Census API?

Thanks a lot again for all your help!

jessecambon commented 3 years ago

@davidkreitmeir you're welcome. I agree, I think that returning the best result (or just one of the results) and indicating that there are multiple results available would be desirable. I'm not sure if the Census team is looking to make adjustments to their service, but I could suggest it.

I haven't done a lot of testing with large Census batch queries, but you can find the code for the most recent testing I did here. Is the main issue you are running into the NA results with "Tie"? Or are you just looking for the most time efficient way to go? At least in theory, whether you get a "tie" result should be dependent on the address and not the size of the batch query you use.

davidkreitmeir commented 3 years ago

@jessecambon thanks for getting back to me on this. I'm currently interest merely in the most efficient way to geocode addresses (I plan to deal with the match_indicator = "Tie" issue later on). Specifically, if parallelizing smaller batch sizes (e.g. 100) is more efficient for instance than big ones (e.g. 5000). From my trial runs it seems to be the case, but there might be better ways to address this task.

jessecambon commented 3 years ago

@davidkreitmeir I haven't tested parallel batches, but I'd be curious what you find out. You could also ask the Census for guidance: geo.geocoding.services@census.gov. I've found them to be pretty responsive.