jessecambon / tidygeocoder

Geocoding Made Easy
https://jessecambon.github.io/tidygeocoder
Other
284 stars 20 forks source link

Include more detail in ArcGIS results when full_results=TRUE #177

Open ottothecow opened 1 year ago

ottothecow commented 1 year ago

I noticed that the results for the ArcGIS geocoder felt a little "light" when full_results=TRUE is set.

In particular, it doesn't provide any indication of what type of match has been found. It has a 'score' variable, but there's no way to know if a score refers to a good point address match or to a match to a city center.

I did some digging in their API documentation and saw that in order to capture FULL address detail, you need to set outfields=* which will more detailed fields rather than just the default.

I can get this in my own code by setting somethign like this:

    geocode(address=full_addr,
            verbose=TRUE,
            method='arcgis',
            full_results=TRUE,
            custom_query=list(outFields='*')) 

But I would vote that outfields=* be automatically set when full_results=TRUE as this brings it inline with the level of detail provided by other geocoders when asking for full results.

jessecambon commented 1 year ago

Hi @ottothecow, thanks for pointing this out. If you look in the API parameters table on the geocoding services page you can see which parameters have default values specified for each service. Currently, with the exception of the US Census service, the only default values being passed are to specify JSON format.

Are there any downsides to adding outFields="*" as a default? For instance, is there a significant difference in the speed of the queries? And how would a user override this default and specify that they only want the default columns?

At the very least, it would be good to add a note on this in the documentation. I just want to make sure adding a default value for a parameter like this doesn't create other complications.

ottothecow commented 1 year ago

I'm not seeing a noticeable difference in time. I drew a random sample of 50 addresses and ran them through each method 5 times:

mbm = microbenchmark(
  custom = samp %>% geocode(address=full_addr,
                         verbose=TRUE,
                         method='arcgis',
                         full_results=TRUE,
                         custom_query=list(outFields='*')),
  default = samp %>% geocode(address=full_addr,
                         verbose=TRUE,
                         method='arcgis',
                         full_results=TRUE),
  times=5
)
Unit: seconds
    expr      min       lq     mean   median       uq      max neval
  custom 24.13781 24.50241 25.88267 26.63848 26.79353 27.34114     5
 default 24.83230 25.14358 26.81704 26.75759 28.60472 28.74701     5

I'd note that on a prior run of 25 addresses over 3 repetitions, the default was marginally faster so I'm not interpreting anything from the fact that the outFields='*' version was faster here. For the record, when I've done geocoding within ArcGIS software in the past, it always returns the full variable set (which is how I noticed it seemed like variables were missing)

Given I'm not seeing a difference in speed I don't necessarily see a harm in adding it as a default. It does roughly triple the size of the resulting dataframe (and similarly increases network overhead), but I assume most users of tidygeocoder aren't using it in real time applications or overly concerned with memory usage. There's always the risk that adding columns could break someone's code, but I would hope most users aren't writing programs that are susceptible to that--especially when using external APIs that can change.

A note in the documentation seems OK too--just needs to be clear that there's more potential information available. In my mind the crucial piece of missing information is the attributes.Addr_type variable as I typically don't want to rely upon any geocodes that don't match 'PointAddress' or 'StreetAddress' and will instead attempt to run them through another API.

Also, perhaps it should be a separate feature request, but now that I am looking at the ArcGIS documentation, it looks like it does allow specifying address components. I did a quick test with the following modifications to api_parameter_reference.R:

  ########################### ArcGis #################################
  # ArcGis may not require an api key

  "arcgis", "address", "SingleLine", NA, FALSE,
  "arcgis", "street", "address", NA, FALSE,
  "arcgis", "city", "city", NA, FALSE,
  "arcgis", "state", "region", NA, FALSE,
  "arcgis", "postalcode", "postal", NA, FALSE,
  "arcgis", "country", "countryCode", NA, FALSE,
  "arcgis", "limit", "maxLocations", "1", FALSE,
  "arcgis", "format", "f", "json", TRUE,
  "arcgis", "outFields", "outFields", "*",TRUE,

It seems to work both for including the extra fields and for geocoding using address components rather than single-line.

Overriding this is possibly by adding custom_query=list(outFields='') which might seem a little awkward and would need to be documented. Not sure if adding it to the api_options list as something like 'arcgis_fields' would be any cleaner. I could see pros/cons either way: using custom_query preserves the API's naming convention, but putting it in api_options might make it more obvious to end users that there is a choice they can make.

jessecambon commented 1 year ago

Good catch RE the ArcGIS address component parameters. I created a separate issue for that feature here: #180

jessecambon commented 1 year ago

@ottothecow I went ahead and made outFields='*' the default like you suggested and added a note about that in the geocoding services page. That's all in the main branch along with the fix for #180 and #178. I can add you as a contributor to the DESCRIPTION file, just let me know if you want me to use your real name or your github username.

ottothecow commented 1 year ago

@jessecambon you can go ahead and list it as Otto Hansen. Thanks!

jessecambon commented 1 year ago

@ottothecow no problem. Also, let me know if you have an email and/or ORCID you want me to add.

ottothecow commented 1 year ago

@jessecambon otto@uchicago.edu works

ottothecow commented 1 year ago

@jessecambon and since you asked, I went ahead and created an ORCID: https://orcid.org/0000-0002-4618-5667