openaq / project-universal-stationID

This repo houses project discussions for creating a universal station ID system.
2 stars 0 forks source link

Proposal unique ID #12

Open olafveerman opened 5 years ago

olafveerman commented 5 years ago

Thanks everybody that contributed to the discussion around incorporating metadata and a unique ID into OpenAQ. We’ve taken some time to process the different ideas. This is the first of two posts that describe our proposal, and focuses on:

  1. the structure of the location ID; and
  2. determine a station’s uniqueness.

The other post (#13) contains a write-up of the technical approach, a notional work plan, and how this affects the inner workings of OpenAQ.

ID structure

The goal is to create an ID that is easy to read for machines and humans, and that is stable over time. We propose to use an ID consisting of:

eg. AF-123456

The country code quickly conveys the most basic information to users. It's unlikely that this will change over time. Six digits should provide us with enough unique IDs within a name-space.

More detailed metadata in the ID

Some of the feedback from the community suggested inclusion of more detailed metadata in the ID. The metadata endpoints that will be built as part of this project already allow users to query data based on metadata. We don’t think that encoding this information in the ID will provide added value, and make the IDs more prone to change as metadata changes.

More precise location

Some of the comments from the community included proposals to encode more fine-grained location information in the ID, including the altitude and atmospheric exposure.

In the current set of locations, we haven’t found cases where stations are so close that they require a higher precision than 5 decimals, let alone additional contextual information like altitude. In addition to this, none of the current measurements or locations have this information at this stage. In the case where this information may become available for a given location, this may be captured in the metadata format.

We can imagine that this precision would be required when looking at low-cost sensors. With OpenAQ’s current focus on government and research sources, we think this can be dealt with at a later stage.

Determining uniqueness

The decision that will affect OpenAQ most, is how to determine a station’s uniqueness. Up until now, it's up to the adapter to produce a unique location name. This is often based on some attribute in the original data. If that original attribute changes, the location in OpenAQ changes. This occurs frequently, and 2385 coordinates in OpenAQ currently have more than one associated location.

Unique Coordinate === Station

Given that most locations (96.4%) have coordinates, we propose to use those to determine uniqueness. This is in line with the comments from the community, most of which rely on constructing uniqueness through some form of station’s geospatial position.

By using coordinates to determine unique stations, we will be able to fix a fair amount of duplicate stations. Currently OpenAQ has 2,385 unique sets of coordinates that have multiple location names. Eg. https://api.openaq.org/v1/locations?coordinates=-21.294338,55.627903&radius=10

Rounding decimals

In addition duplicate location names, there are locations that have coordinates with minor differences in their coordinates. To deduplicate these locations, we propose to round the coordinates when determining uniqueness to either 4 or 5 decimals:

This translates into a precision of ~1.1 meter (5 decimals) or ~11.1 meters (4 decimals) at the equator, getting less precise the further away.

For more insight into duplicate coordinates, please see this sheet.

What if a coordinate changes over time?

If a station’s coordinates (rounded to 5 decimals) changes, this is considered a new station.

What about locations without coordinates?

381 locations currently have no coordinates in OpenAQ. Of these, 155 have reported data within the past 3 months. The active stations are mostly located in Israel, Thailand, South Africa, and Australia.

Our current working assumption, is that we’ll be able to continue to rely on the location name as the unique ID. If, at any stage in the future, coordinates are added for these locations, they automatically receive a station ID.

maschu09 commented 5 years ago

Sounds great! Only one question: "Our current working assumption, is that we’ll be able to continue to rely on the location name as the unique ID. If, at any stage in the future, coordinates are added for these locations, they automatically receive a station ID." This implies that the unique ID field for those stations would be empty until they provide a location? I would prefer to assign stationID to these sites immediately and simply add coordinates when they are provided. Otherwise searching for unique stationID would require two queries.

RocketD0g commented 5 years ago

I hear you @maschu09. The core problem with assigning a Station ID w/o coordinates and therefore relying on the location name is that the location name isn't stable over time. Agencies change them (sometimes doing things like changing a station called Anand Vihar to AnandVihard or Anand-Vihar) so the unique ID system would get populated with multiple Station IDs for what is physically the same station and defeat the purpose of implementing the universal Station ID.

So practically, if the no-coordinate-stations were assigned a station ID based on their location name, even if you can get the info you want in one query, you'd have to worry you're not really getting back a true set of stations, but rather a set of stations with potential duplicates in them. Does that make sense? (@olafveerman, @jflasher, feel free to chime in case you think of this differently or have additional comments)

Pulling in @maschu09 comment here, I also agree that it'd be good to assign 'XX' (or some equivalent) to data sources where the country is not known.

maschu09 commented 5 years ago

@RocketD0g Sure you are right. My point is not about the "correctness" os such poorly defined station ids, but rather the completeness of a metadata record. It is the distinction between an optional (0..n) and a mandatory (1..n) metadata field. I would like to advocate that stationid as a field is mandatory. Now there are two possibilities for poorly defined stationids:

  1. always asign a "null" or "none" value if the coordinates are unknown. Advantage: you make no mistakes. Disadvantage: all stations without coordinates share the same "none" information.
  2. use the location name even if it may change occasionally. Advantage: this provides a starting point for some poor chap to sift through all of these records and try to identify sites with "similar" names so that they can be combined into one record. Disadvantage: if there is no log to distinguish "fake" ids from real ids, then it will be difficult to identify stations without a proper id in the first place.

So, maybe,we need to separate this problem and define this as a different task. Perhaps I am also not 100% sure yet how the new "locations" table is supposed to be built and what relation it shall have with the buckets. Am I hearing correctly that your intentions are to have this table only loosely coupled to the buckets? I.e. one can use locations to build your own locations table, then when reading buckets you can link each data record to a location via the coordinates (which are associated with a specific stationid). If a record in the buckets has no coordinates, it cannot be associated with a stationid and thus hardly be used for any analysis. OR are you seeing this as a stringer relation similar to a relational database where every record in the buckets is associated with exactly one entry in the locations table (N:1 relation)?

In the end this comes down to who is resonsible for data (i.e. table) consistency: the service provider or the user. In the ideal world it shoul dbe the provider, but I see that with limited resources it often has to be the user. Such is life...

RocketD0g commented 5 years ago

Got you, @maschu09

For the number 2 option you lay out, I think, if I understand right, location will already get you back that info (basically a total of all stations, coordinates or no, duplicates or no), that have reported data to the system. And then the unique station ID can help at least point to the stations, or 'locations' in OpenAQ data format parlance, that are concretely identified with coordinates/non-dups.

The number 1 option makes sense to me.

@olafveerman, would you agree with my statement?

For this piece:

So, maybe,we need to separate this problem and define this as a different task. Perhaps I am also not 100% sure yet how the new "locations" table is supposed to be built and what relation it shall have with the buckets. Am I hearing correctly that your intentions are to have this table only loosely coupled to the buckets? I.e. one can use locations to build your own locations table, then when reading buckets you can link each data record to a location via the coordinates (which are associated with a specific stationid). If a record in the buckets has no coordinates, it cannot be associated with a stationid and thus hardly be used for any analysis. OR are you seeing this as a stringer relation similar to a relational database where every record in the buckets is associated with exactly one entry in the locations table (N:1 relation)?

I'm going to have to rely on input from @olafveerman and @sethvincent on how they will technically accomplish the table piece.

jflasher commented 5 years ago

@maschu09 I'll defer to Development Seed on the implementation pieces, but I'm in favor of option 1 and this essentially lets you get the behavior of option 2 client side by doing something like stationID || location anywhere you want to have a way to reference a station. That'd preferentially use the station ID if it's there but would default to location if needed but would also keep it more transparent as to what's happening, I think.

For your second part, I've been thinking of this as you first described it. The conflation (or joining) of stationID with the record can be done by the system when the system is involved like in cases of using the API or using something like the Athena queries. But the S3 buckets have pretty much the raw source data, which stationID is not a part of. So if you're reading from those buckets, I think it would be up to the user to match to a stationID using the coordinates. But we can try to make this as easy as possible by having the location data also stored similarly in an S3 bucket and not needing to make someone access database table (I'm happy to do that if it doesn't get done via this initial work).

Though it just occurred to me that maybe this is all made easier if we use a hashing function to generate the ids? If we did that, you would not need to look up in a table what the unique id would be. Your code could generate the hash the same as the OpenAQ system and both would come up with the same unique stationID. I'm sure there are some downsides here, but I do not immediately see them?