ropensci / opencage

:globe_with_meridians: R package for the OpenCage API -- both forward and reverse geocoding :globe_with_meridians:
https://docs.ropensci.org/opencage
87 stars 11 forks source link

oc_forward parameter countrycode unexpectedly changes results #159

Closed ddunn801 closed 5 months ago

ddunn801 commented 5 months ago

Description & steps to reproduce

The results returned are unexpectedly different when including the country_code parameter in oc_forward, even though all results include the US country_code.

install.packages("opencage")
library(opencage)
oc_config(key = "xxxxxxxxxx", no_record = TRUE, show_key = FALSE) # replace x's with your opencage key
toget <- "552%20NE%20Olney%20Ave%2C%2097701" # raw address: "553 NE Olney Ave, 97701"

# 2 results with confidences of 10 & 7
as.data.frame(oc_forward(placename = toget, return = "df_list"))

# 2 results with confidences of 8 & 7
as.data.frame(oc_forward(placename = toget, countrycode = "US", return = "df_list"))
as.data.frame(oc_forward(placename = toget, countrycode = "us", return = "df_list"))

Session information

R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] opencage_0.2.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9        rstudioapi_0.15.0 magrittr_2.0.3    R6_2.5.1          rlang_1.1.1       fastmap_1.1.0     fansi_1.0.3       tools_4.2.1      
 [9] pacman_0.5.1      utf8_1.2.2        cli_3.5.0         assertthat_0.2.1  httpcode_0.3.0    tibble_3.1.8      lifecycle_1.0.3   purrr_1.0.0      
[17] ratelimitr_0.4.1  vctrs_0.5.1       triebeard_0.3.0   curl_5.0.1        crul_1.3          memoise_2.0.1     glue_1.6.2        cachem_1.0.6     
[25] compiler_4.2.1    pillar_1.8.1      urltools_1.7.3    jsonlite_1.8.7    pkgconfig_2.0.3 
dpprdan commented 5 months ago

I can (basically) reproduce your results:

library(opencage)

addr <- "553 NE Olney Ave, 97701"

oc_forward_df(addr, limit = 10L)
#> # A tibble: 2 × 4
#>   placename               oc_lat oc_lng oc_formatted                            
#>   <chr>                    <dbl>  <dbl> <chr>                                   
#> 1 553 NE Olney Ave, 97701   44.1  -121. 553 Northeast Olney Avenue, Bend, OR 97…
#> 2 553 NE Olney Ave, 97701   44.1  -121. Deschutes County, OR 97701, United Stat…
oc_forward_df(addr, limit = 10L, countrycode = "US")
#> # A tibble: 2 × 4
#>   placename               oc_lat oc_lng oc_formatted                            
#>   <chr>                    <dbl>  <dbl> <chr>                                   
#> 1 553 NE Olney Ave, 97701   44.1  -121. NE Olney Ave, Bend, OR, United States o…
#> 2 553 NE Olney Ave, 97701   44.1  -121. Deschutes County, OR 97701, United Stat…

I’ve used the raw address you provided. Your toget contains a slightly different housenumber IIUC. BTW, {opencage} does the URL-encoding for you, so you can use the “raw” address. Note that oc_forward_df() directly returns a data.frame and not a df_list, which is more practical for this use case IMHO. That said, the different results come from the OpenCage API and are not caused by the {opencage} package. These are the queries sent to the OpenCage API (with OPENCAGE_KEY replaced by an actual API key):

oc_forward(addr, limit = 10L, return = "url_only")
#> [[1]]
#> [1] "https://api.opencagedata.com/geocode/v1/json?q=553%20NE%20Olney%20Ave%2C%2097701&limit=10&no_annotations=1&roadinfo=0&no_dedupe=0&no_record=1&abbrv=0&address_only=0&add_request=0&key=OPENCAGE_KEY"
oc_forward(addr, limit = 10L, countrycode = "US", return = "url_only")
#> [[1]]
#> [1] "https://api.opencagedata.com/geocode/v1/json?q=553%20NE%20Olney%20Ave%2C%2097701&countrycode=us&limit=10&no_annotations=1&roadinfo=0&no_dedupe=0&no_record=1&abbrv=0&address_only=0&add_request=0&key=OPENCAGE_KEY"

Some points I noticed while looking into this:

  1. The address/housenumber is apparently not tagged in OSM (at least they don’t show up on OSM), i.e. OSM doesn’t “know” where exactly that housenumber is, but uses a heuristic to return approximate coordinates. (I don’t know which other sources OpenCage might be using for exactly that address.) When I enter the returned coordinates into either Google, Bing (TomTom) or HERE maps, they seem to be slightly off, i.e. nearer to housenumbers 533 or 545 than to 553.
  2. The queried address is not entirely formatted according to the advice given by OpenCage, e.g. the country is missing and there are abbreviations (“NE” and “Ave”). This is probably what you’ve got work with, but it means that OpenCage has to normalize the address first. In fact, I think that is what is happening the first case, i.e. OpenCage is returning the normalized address. In the second case it has a countrycode as additional information, so the query is probably taking another path in the OpenCage algorithm.

Long story short: I can reproduce the results but there is nothing (sensible) I can do at the {opencage} package level to fix this. The result is a bit strange but I am not sure I would call this a bug that needs fixing at the API level. Pinging @freyfogle nevertheless.

Created on 2024-06-27 with reprex v2.1.0

freyfogle commented 5 months ago

Great diagnosis @dpprdan very comprehensive.

Will investigate.

But yes, adding the address to OpenStreetMap is always a great idea. Here's a guide.

freyfogle commented 5 months ago

Wow, this turned into quite the interesting bug, thanks so much for posting it.

One of the big challenges we have to deal with is people sending us only partially formed addresses. Basically when countrycode is set, we do additional country-specific logic.

Say for example someone send us 553 NE Olney Ave, 97701 because they can't be bothered to include the town name or the state, as is common in US addresses.

So we have a lot of logic to try to add missing information like that. This includes expanding common abbreviations, for example, the abbreviations of state codes.

As I'm sure you are aware NE is the two-letter state code of the great state of Nebraska (The Cornhusker State). That leads to all sorts of confusion.

This is now fixed and test cases added.

dpprdan commented 5 months ago

Both queries return the same result now, indeed:

library(opencage)

addr <- "553 NE Olney Ave, 97701"

(oc1 <- oc_forward_df(addr, limit = 10L))
#> # A tibble: 2 × 4
#>   placename               oc_lat oc_lng oc_formatted                            
#>   <chr>                    <dbl>  <dbl> <chr>                                   
#> 1 553 NE Olney Ave, 97701   44.1  -121. 553 Northeast Olney Avenue, Bend, OR 97…
#> 2 553 NE Olney Ave, 97701   44.1  -121. Deschutes County, OR 97701, United Stat…
(oc2 <- oc_forward_df(addr, limit = 10L, countrycode = "US"))
#> # A tibble: 2 × 4
#>   placename               oc_lat oc_lng oc_formatted                            
#>   <chr>                    <dbl>  <dbl> <chr>                                   
#> 1 553 NE Olney Ave, 97701   44.1  -121. 553 Northeast Olney Avenue, Bend, OR 97…
#> 2 553 NE Olney Ave, 97701   44.1  -121. Deschutes County, OR 97701, United Stat…

all.equal(oc1, oc2)
#> [1] TRUE

Thanks @ddunn801 for reporting!