Closed kmcdono2 closed 1 year ago
It depends on the choice of geocoders. Some geocoders (e.g. arcgis) do not change the results with city name and county name appended, while some others (e.g. google) change the results accordingly. Adding the city name or county name do not guarantee the geocoding results to be within that area (even for google geocoder), since those geocoders have an internal mechanism to rank the candidates, and there is still a possibility that a higher ranked candidate is outside the city/county boundary. Thus a filtering step is still required.
Error analysis is a good suggestion. I'll try to identify what are the recognized text labels that leads to false geolocated places.
@zekun-li Great. I think going forward, knowing that there is a massive amount of duplication of city names across US states, we would want to only use a geocoder that can include these kinds of rules.
@zekun-li - We talked about this in the meeting on Thursday, but I wanted to get it in written form so you can review/discuss with @yaoyichi if needed.
I really like that you have thought about how to reduce error in the Sanborn results by including the city name. Later, in the clustering method, you use the county shapefiles to limit the results to the relevant counties (remember that Washington, DC does not have counties, and that the abbreviation for Washington County AK was for the wrong state...). I think you would get much better results in the first geocoding step if you include [text spotting result], city, state. Many of the results I see (here: file:///Users/kmcdonough/Downloads/plot_geocoding_results.html) I would guess are because there are cities with the same name across the US and Mexico. Limiting by state at this point in the method would reduce false positives.
Would it be possible to re-run the Sanborn's with this in mind? And could there be a notebook that shows not only the map at the continent level, but also shows the list of outputs for the geocoding by map so that it's easy to see where possible remaining falsely geolocated places are?
I think taking this step would significantly improve the distance error distribution! Happy to follow up here or by email.