ropensci / osmdata

R package for downloading OpenStreetMap data
https://docs.ropensci.org/osmdata
317 stars 45 forks source link

populate osm_id with an actual unique reference #334

Closed joostschouppe closed 1 year ago

joostschouppe commented 1 year ago

When using osmdata, a column with an osm_id value is returned. However, this is purely numeric. OSM data is only uiniquely identifiable if you also know it is a node, way or relation. Points will of course come from nodes, lines from ways and multipolygons from relations. But regular polygons are likely to come from closed ways, but could as well come from a (usually unneeded, but it happens) relation.

It might be better to format the osm_id column with data like node/123, relation/456. For the time being, is there any way to query which OSM data type was the source of the object in the downloaded data?

mpadge commented 1 year ago

I'm not exactly sure what you mean. The osm_id values in this package are the direct OSM values, which are in turn 64-bit integer identifiers. An osmdata_sc object inherits all of these directly, while an osmdata_sf object sometimes breaks single osm_id objects (ways or relations) into multiple sub-objects, as explained in the OSM-to-sf vignette.

Aside from that, all osm_id values remain at all times identical to OSM identifiers. If you have a more specific question, please try to reduce it to a reprex which you can paste here to help explore more.

joostschouppe commented 1 year ago

I'll try to refrase. I now see an osm_id with a number 123456. However, in OSM there are three objects with the number 123456: a node, a way, and a relation. So if you want to directly link to an osm object, the osm_id should have either node/12346, way/123456 or relation/123456.

mpadge commented 1 year ago

Ah okay, I get you. I will presume you're only interested in osmdata_sf objects, right? Because SC objects already have all that information. For sf objects, the nodes are all explicitly identifiable, either as the elements of "osm_points", or the row names of all coordinate matrices within the sf objects themselves.

But the way and relation objects may indeed be mapped on to elements of (multi-)line/polygon objects, with no disambiguation of where the ID values originated. One important design choice of this package is to stay as true to the underlying data model as possible, so renaming "osm_id" values is not an option, but I imagine what could be usefully done would be to add an additional column, "osm_type", with values either "node", "way", or "relation." Would that solve your issue?

That shold be pretty easy to implement, but at some stage we'll need to document motivations for this new feature, and so I'd like to ask you to please provide a concise example where such ambuguity is problematic. I guess it would suffice to demonstrate inability to use the "opq_osm_id()" function without knowing the OSM type? Feel free to paste a prototype use-case here, and I'll ping you from a PR when it's far enough.

Thanks for the constructive thoughts!

mpadge commented 1 year ago

Hmm, actually the distinction is precise: all OSM relations are mapped on to sf "multi"-type objects:

https://github.com/ropensci/osmdata/blob/38431337f71ca9fd01e528df42f33c86aefe9272/src/osmdata-sf.cpp#L498-L511

And all OSM ways are mapped on to either "line" or "polygon" objects:

https://github.com/ropensci/osmdata/blob/38431337f71ca9fd01e528df42f33c86aefe9272/src/osmdata-sf.cpp#L513-L537

So the above thought would then just reduce to a (presumably entirely redundant) insertion of "way" in "line"/"polygon" objects, and "relation" in "multi"-type objects.

This issue then really extends back to the complex and difficult mapping between OSM data structures and simple features. Difficulties and inaccuracies with that were a large part of the motivation for developing the "osmdata_sc" class, which preserves a far more OSM-faithful representation of the data.

I'll wait for further thoughts from you before doing anything here.

joostschouppe commented 1 year ago

Hey, thanks for thinking along! The use-case is indeed just that in our final product for users, we like to provide an URI to the source of the object they see. In our processing flow, the sf are more than good enough. Since relations are always pushed to the multi objects (even if that happens to not be needed for every individual object), then my issue is non-existent and I can just create a full OSM url (which functions as an URI) without any issue.