pelias / openstreetmap

Import pipeline for OSM in to Pelias
MIT License
112 stars 72 forks source link

use array offset instead of value for address ID generation #547

Closed missinglink closed 3 years ago

missinglink commented 3 years ago

while working on https://github.com/pelias/openstreetmap/pull/546 I noticed this, which might be a bug, might be intended, not sure.

a single OSM record can become multiple records in Pelias. ie. it can become a single venue and multiple address records.

so we generate a unique ID for each of those records in Pelias.

what's currently happening here is the first address has the same ID as the venue (if there is one), since its on a different layer it'll get a different GID, all good.

for subsequent address records we append the housenumber value to the ID:

Screenshot 2020-12-23 at 20 46 51

I think that is a bit dangerous since the housenumber field is a freeform OSM string which could contain all sorts of characters such as spaces, or : which is used as a delimiter in the GID scheme.

This PR changes that to use the array offset instead of the array value, a subtle difference but it'll ensure that these IDs end with a number and don't contain any punctuation or alpha chars.

I might assign a couple of reviewers here since this slightly changes how we generate IDs, I'm assuming it won't break anything?

orangejulius commented 3 years ago

Nice, yeah, we should definitely be careful with what goes in our GIDs. An incrementing counter is much better than trusting data in an OSM tag.

This will, technically, be a breaking change, but I imagine the number of addresses involved is extremely small. OSM IDs are not stable anyway, so it's probably not a big concern.

However, while we're at it, should we change the final delimiter away from :? In the past we've talked about trying to crack down on colons appearing outside of the source:layer:id boundaries.

missinglink commented 3 years ago

However, while we're at it, should we change the final delimiter away from :? In the past we've talked about trying to crack down on colons appearing outside of the source:layer:id boundaries.

Yeah that would probably be a good idea, the way it currently is the GID must be parsed as {source}:{layer}:{id}... (ie. everything after the second delim is part of the ID, regardless of whether it contains a delim or not.

It would be cleaner to be able to do .split(':')[2] and be sure that it contained the ID, which isn't currently possible and kinda non-obvious for something which is a public API which integrators have to write themselves.

That being said .split(':')[2] does actually return the correct OSM ID in this case, so maybe it's correct 🤷‍♂️

Can we discuss that in a separate Issue/PR since I could also see us hashing it out and coming to the conclusion that the GID scheme is actually {source}:{layer}:{id}[:{ord}]?

missinglink commented 3 years ago

ok ok, so literally two minutes after writing that comment I flip-flopped on my position and pushed a second commit ;)

In the example above the GID is: openstreetmap:address:way/824683453:1338 It would be cleanest if there were two steps:

  1. split the three components on :
  2. split the ID parts on / (which is also a problematic delim for URLs but what we're currently using)

So parsing it is easy:

var gid = "openstreetmap:address:way/824683453/1"
[ source, layer, id ] = gid.split(':')
[ osmType, osmId, ord ] = id.split('/')
missinglink commented 3 years ago

[Y/n]?

orangejulius commented 3 years ago

Yeah, that's probably the best we can do. It's nice that most OSM IDs line up with the URL one uses to reference that record on openstreetmap.org, but when Pelias creates multiple documents from a single OSM record we have to keep it unique somehow.

missinglink commented 3 years ago

cc/ @blackmad this change could potentially break the way the links are generated by the compare app when clicking off to openstreetmap?