Open katieashaw opened 2 years ago
One nuance for this particular instance that I think would solve a lot of the issues we ended up moderating: there is a village called Hetian in Dongguan, in which there are a lot of facilities. Xinjiang also has a prefecture that can go by that name (https://en.wikipedia.org/wiki/Hotan_Prefecture). It seems the geocoding is seeing that word and putting it in Xinjiang, rather than Dongguan. Maybe there's a rougher fix for this particular issue that can be done more quickly as we look at a larger, longer-term fix for geocoding in China in general.
Noting that, of the 51 facilities that were plotting in the area search, 24 needed to be moved. Details here: Xinjiang Facilities - facilities (9).csv
One additional question in relation to this issue: it's not clear to us why running the addresses through Google Maps plots them in the correct province, whereas the OAR geocoder plotted them incorrectly in Xinjiang. If our geocoder is powered by Google, why did this discrepancy occur?
@jwalgran Notes as we look into this:
partial_match = True
. We do also see this on many items with which we have had no geolocation issues, however.In terms of handling inconclusive geolocation results, I think there are a few options we could pursue, but there’s no perfect answer.
Another example (in the opposite use case) to share for this issue is this facility: https://openapparel.org/facilities/CN2020281S5SP9P?q=CN2020281S5SP9P
It's currently plotting in Shenzhen, because that is in the name of the industrial park in the address, but it should be plotting in Xinjiang.
Some more detail on potentially using the geocoding response details to return additional information in the facility details API.
These excerpts are taken from the Google geocoding API documentation https://developers.google.com/maps/documentation/geocoding/requests-geocoding
partial_match
indicates that the geocoder did not return an exact match for the original request, though it was able to match part of the requested address. You may wish to examine the original request for misspellings and/or an incomplete address.
location_type
stores additional data about the specified location. The following values are currently supported:
- "ROOFTOP" indicates that the returned result is a precise geocode for which we have location information accurate down to street address precision.
- "RANGE_INTERPOLATED" indicates that the returned result reflects an approximation (usually on a road) interpolated between two precise points (such as intersections). Interpolated results are generally returned when rooftop geocodes are unavailable for a street address.
- "GEOMETRIC_CENTER" indicates that the returned result is the geometric center of a result such as a polyline (for example, a street) or polygon (region).
- "APPROXIMATE" indicates that the returned result is approximate.
If we return return the values of the partial_match
flag and the location_type
we can show on the facility details page messages explaining potential known inaccuracies of the coordinates.
How would this work for the entries where there is a partial_match = true flag but the geocode is actually correct (as Tai references in option #1 of the above comment)? OR does that only happen in the case where partial match = true AND the geocode result is "ROOFTOP"? In which case we would see that result and ignore the partial_match flag?
My assumption is that we would not encounter both partial_match = true
and as result type of ROOFTOP
since ROOFTOP
means they Google was able to plot the location exactly. If this did occur for some reason, your suggestion makes sense. We would likely not show any message if we got a match type of ROOFTOP
.
OK, thanks Justin. The part I'm struggling to understand is why does Tai mention:
"When the top result (and therefore presumably all the results) has the partial_match flag set for a new facility, set the has_inexact_coordinates flag on the Facility. Downside: I think this would result in that flag being set in a lot of cases where the geolocation is actually correct."
Are you saying that instead of showing a generic "inexact coordinates" in all these cases, we would instead show the geocode result. Which would give the user a better sense of what is going on with the data, and hopefully a better feeling about data reliability?
To make all the context explicit, has_inexact_coordinates
refers to this checkobx that the team can set via the Django admin
Checking that option shows explanatory text below the coordinate in the sidebar.
To date this has been a fully manual process and the implementation option we are discussing is to ass some programmatic checking of this box when Google tells us it is a partial_match
. It is quick to implement because we would just be adding automation to an existing feature, not a wholly new feature.
Thanks for the detail, Justin. I'm understanding the part about setting the inexact flag automatically based on the geocoder result.
What I am hung up on is this comment from Tai: "When the top result (and therefore presumably all the results) has the partial_match flag set for a new facility, set the has_inexact_coordinates flag on the Facility. Downside: I think this would result in that flag being set in a lot of cases where the geolocation is actually correct. We could cut down frequency by also only doing this on items which have an unusually large number of results (say, 5+) but that’s pretty arbitrary."
According to that comment, we have no way of knowing how frequently we will set the 'has inexact coordinates' flag when the geocode result is correct - i.e. is this a false positive 25% of the time? 50%, 75%? The alternative is that we don't do anything and we let the google geocoder get better on its own. Do you have any sense of if we are talking about more or less than 25% of the time?
cc: @hlennett @katieashaw if we enable programmatically setting this flag, that should cut down or eliminate the need for you to manually set it. But how many false positives (the flag is set when the coordinates are correct) is too many? (e.g. how worried are you about users complaining that this message is always there)
Really interesting to read all this through, thanks everyone! Mariel raises an excellent point that we will erode the trust of our users if this flag starts to crop up too frequently on facility profiles. I wouldn't want to see it occurring more than 25% of the time, ideally.
We have recorded the full response from the Google geocoder for every item that has ever been submitted to the OAR so it would be possible for us to analyze that data and determine the the percent of results that are partial_match
and also the distribution of the different location_type
values.
Wether or not a partial_match
geocode result is "correct" is not something we can determine programmatically. If we do the analysis and produce a list of all the partial match results we could spot check them to see what the results look like and make more educated assertions about how the the partial_match
flag corresponds to geocoding result quality.
We have decided to table this issue for now and incorporate it into a larger review of the geocoding process as par of OGR planning and development.. Moving this to the backlog and removing the high priority label.
Overview
Running an area search for facilities in Xinjiang currently returns 51 results
In reviewing the data, it's clear that the vast majority is plotting incorrectly in the region. Numerous examples are from Dongguan, with others in Shenzen and Zheijian. Only ~16 entries have Xinjiang in the name or address.
Expected Behavior
Facilities should plot in the correct province / region.
Actual Behavior
Numerous facilities are incorrectly plotting in Xinjiang.
Demo
CSV export of the search: facilities (1).csv
Additional context
This is problematic on two counts:
1) It raises serious concerns about the accuracy of Google's geocoding in China 2) Xinjiang is an incredibly sensitive area geopolitically, and the OAR is currently incorrectly connecting organizations to this area, which is extremely problematic.