openaq / battuta

Reverse geocoding for air quality stations
MIT License
2 stars 0 forks source link

metadata file issue #5

Open maxgrossman opened 7 years ago

maxgrossman commented 7 years ago

@olafveerman and I have spent some time with the metadata file to see how it may be generative of #4.

There are two culprits.

  1. Multiple records of the same station have lat/long that are the same but use different precision. We can handle this by reading in coordinates to a set precision (6 decimals perhaps).
  2. Unique locations have the same station id.

I'm going to work on the code to handle these issues, and while doing so flag those station ids that have multiple locations and provide that here.

maxgrossman commented 7 years ago

regarding the 2nd issue above, I took the metadata file and grouped each record by station id, then by coordinate to find the unique locations within each station id group. See below just a small sample of the output (the object key here is longitude).

 {
    "47.067689": [
    ...
    ],
    "47.067691": [
    ...
    ]
  },
 {
    "35.151952": [
    ...
    ],
    "0.000316": [
    ...
    ]
  }

Most location groups match the first object where stations are but a few hundred thousandths, ten thousandths of a degree off (and as such only a few 10s/100s meters off from one another...)

I'd think in either case just selecting the 1st of unique set of coordinates among the records would be viable solution. If we want to root out the certain outliers (like the last record above) maybe we spatial select by all of europe first, then reverse geocode.

cc @olafveerman

olafveerman commented 7 years ago

Did some further digging.

In total, there are 775 station IDs that have multiple coordinates. See full list of station id's with multiple coordinates. The majority of the coordinates differ little (<0.001), but there are some significant differences that may lead to different outcomes of the reverse geocoding:

See this CSV with the results.

This issue is best resolved at the source. @jflasher @RocketD0g Would it be worth sending EEA the list with these issues?

jflasher commented 7 years ago

I think it'd be great to report this back up to EEA. I can send that along with associated data if it's all ready to go?

On August 28, 2017 at 17:12:40, Olaf Veerman (notifications@github.com(mailto:notifications@github.com)) wrote:

Did some further digging.

In total, there are 775 station IDs that have multiple coordinates. See full list of station id's(https://github.com/openaq/battuta/files/1258361/multiple-coords.txt) with multiple coordinates. The majority of the coordinates differ little (<0.001), but there are some significant differences that may lead to different outcomes of the reverse geocoding:

21 stations have a cumulative difference > 0.01 65 stations have a cumulative difference > 0.001

See this CSV with the results(https://gist.github.com/olafveerman/15a526fffc2059a6f18a089a6c31b9f1).

This issue is best resolved at the source. @jflasher(https://github.com/jflasher) @RocketD0g(https://github.com/rocketd0g) Would it be worth sending EEA the list with these issues?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub(https://github.com/openaq/battuta/issues/5#issuecomment-325483316), or mute the thread(https://github.com/notifications/unsubscribe-auth/AAz0JgFR2fuI3uxIO_Z3nKDcYgFZOzxYks5scy1IgaJpZM4PErCx).

olafveerman commented 7 years ago

@jflasher Great. There are a couple of issues with network_timezone that @maxgrossman will report back on in: https://github.com/openaq/openaq-fetch/issues/298 Maybe you can bundle that up? Feel free to cc and defer to us if they have more questions.