opensupplyhub / open-apparel-registry

An application for searching, matching, uploading factories.
MIT License
32 stars 13 forks source link

Numerous geo-coding errors in China regional search: multiple facilities plotting in Xinjiang which are not in the region #1572

Open katieashaw opened 2 years ago

katieashaw commented 2 years ago

Overview

Running an area search for facilities in Xinjiang currently returns 51 results

In reviewing the data, it's clear that the vast majority is plotting incorrectly in the region. Numerous examples are from Dongguan, with others in Shenzen and Zheijian. Only ~16 entries have Xinjiang in the name or address.

Expected Behavior

Facilities should plot in the correct province / region.

Actual Behavior

Numerous facilities are incorrectly plotting in Xinjiang.

Demo

CSV export of the search: facilities (1).csv

Additional context

This is problematic on two counts:

1) It raises serious concerns about the accuracy of Google's geocoding in China 2) Xinjiang is an incredibly sensitive area geopolitically, and the OAR is currently incorrectly connecting organizations to this area, which is extremely problematic.

hlennett commented 2 years ago

One nuance for this particular instance that I think would solve a lot of the issues we ended up moderating: there is a village called Hetian in Dongguan, in which there are a lot of facilities. Xinjiang also has a prefecture that can go by that name (https://en.wikipedia.org/wiki/Hotan_Prefecture). It seems the geocoding is seeing that word and putting it in Xinjiang, rather than Dongguan. Maybe there's a rougher fix for this particular issue that can be done more quickly as we look at a larger, longer-term fix for geocoding in China in general.

katieashaw commented 2 years ago

Noting that, of the 51 facilities that were plotting in the area search, 24 needed to be moved. Details here: Xinjiang Facilities - facilities (9).csv

katieashaw commented 2 years ago

One additional question in relation to this issue: it's not clear to us why running the addresses through Google Maps plots them in the correct province, whereas the OAR geocoder plotted them incorrectly in Xinjiang. If our geocoder is powered by Google, why did this discrepancy occur?

TaiWilkin commented 2 years ago

@jwalgran Notes as we look into this:

In terms of handling inconclusive geolocation results, I think there are a few options we could pursue, but there’s no perfect answer.

hlennett commented 2 years ago

Another example (in the opposite use case) to share for this issue is this facility: https://openapparel.org/facilities/CN2020281S5SP9P?q=CN2020281S5SP9P

It's currently plotting in Shenzhen, because that is in the name of the industrial park in the address, but it should be plotting in Xinjiang.

jwalgran commented 2 years ago

Some more detail on potentially using the geocoding response details to return additional information in the facility details API.

These excerpts are taken from the Google geocoding API documentation https://developers.google.com/maps/documentation/geocoding/requests-geocoding

partial_match indicates that the geocoder did not return an exact match for the original request, though it was able to match part of the requested address. You may wish to examine the original request for misspellings and/or an incomplete address.

location_type stores additional data about the specified location. The following values are currently supported:

  • "ROOFTOP" indicates that the returned result is a precise geocode for which we have location information accurate down to street address precision.
  • "RANGE_INTERPOLATED" indicates that the returned result reflects an approximation (usually on a road) interpolated between two precise points (such as intersections). Interpolated results are generally returned when rooftop geocodes are unavailable for a street address.
  • "GEOMETRIC_CENTER" indicates that the returned result is the geometric center of a result such as a polyline (for example, a street) or polygon (region).
  • "APPROXIMATE" indicates that the returned result is approximate.

If we return return the values of the partial_match flag and the location_type we can show on the facility details page messages explaining potential known inaccuracies of the coordinates.

mariel-oar commented 2 years ago

How would this work for the entries where there is a partial_match = true flag but the geocode is actually correct (as Tai references in option #1 of the above comment)? OR does that only happen in the case where partial match = true AND the geocode result is "ROOFTOP"? In which case we would see that result and ignore the partial_match flag?

jwalgran commented 2 years ago

My assumption is that we would not encounter both partial_match = true and as result type of ROOFTOP since ROOFTOP means they Google was able to plot the location exactly. If this did occur for some reason, your suggestion makes sense. We would likely not show any message if we got a match type of ROOFTOP.

mariel-oar commented 2 years ago

OK, thanks Justin. The part I'm struggling to understand is why does Tai mention:

"When the top result (and therefore presumably all the results) has the partial_match flag set for a new facility, set the has_inexact_coordinates flag on the Facility. Downside: I think this would result in that flag being set in a lot of cases where the geolocation is actually correct."

Are you saying that instead of showing a generic "inexact coordinates" in all these cases, we would instead show the geocode result. Which would give the user a better sense of what is going on with the data, and hopefully a better feeling about data reliability?

jwalgran commented 2 years ago

To make all the context explicit, has_inexact_coordinates refers to this checkobx that the team can set via the Django admin

Screen Shot 2022-02-16 at 10 27 27 AM

Checking that option shows explanatory text below the coordinate in the sidebar.

Screen Shot 2022-02-16 at 10 29 44 AM

To date this has been a fully manual process and the implementation option we are discussing is to ass some programmatic checking of this box when Google tells us it is a partial_match. It is quick to implement because we would just be adding automation to an existing feature, not a wholly new feature.

mariel-oar commented 2 years ago

Thanks for the detail, Justin. I'm understanding the part about setting the inexact flag automatically based on the geocoder result.

What I am hung up on is this comment from Tai: "When the top result (and therefore presumably all the results) has the partial_match flag set for a new facility, set the has_inexact_coordinates flag on the Facility. Downside: I think this would result in that flag being set in a lot of cases where the geolocation is actually correct. We could cut down frequency by also only doing this on items which have an unusually large number of results (say, 5+) but that’s pretty arbitrary."

According to that comment, we have no way of knowing how frequently we will set the 'has inexact coordinates' flag when the geocode result is correct - i.e. is this a false positive 25% of the time? 50%, 75%? The alternative is that we don't do anything and we let the google geocoder get better on its own. Do you have any sense of if we are talking about more or less than 25% of the time?

cc: @hlennett @katieashaw if we enable programmatically setting this flag, that should cut down or eliminate the need for you to manually set it. But how many false positives (the flag is set when the coordinates are correct) is too many? (e.g. how worried are you about users complaining that this message is always there)

katieashaw commented 2 years ago

Really interesting to read all this through, thanks everyone! Mariel raises an excellent point that we will erode the trust of our users if this flag starts to crop up too frequently on facility profiles. I wouldn't want to see it occurring more than 25% of the time, ideally.

jwalgran commented 2 years ago

We have recorded the full response from the Google geocoder for every item that has ever been submitted to the OAR so it would be possible for us to analyze that data and determine the the percent of results that are partial_match and also the distribution of the different location_type values.

Wether or not a partial_match geocode result is "correct" is not something we can determine programmatically. If we do the analysis and produce a list of all the partial match results we could spot check them to see what the results look like and make more educated assertions about how the the partial_match flag corresponds to geocoding result quality.

jwalgran commented 2 years ago

We have decided to table this issue for now and incorporate it into a larger review of the geocoding process as par of OGR planning and development.. Moving this to the backlog and removing the high priority label.