Using reference IDs as stable identifiers in historical time series data

schnerd commented 6 years ago

Hey folks – wanted to get some clarification on the reliability of using SharedStreets Reference IDs as a stable identifier through time.

Each week we want to process GPS traces into speed profiles matched to the shared streets referencing system. In order to support complex filtering and aggregation across multiple years of data, these results need to be stored in a database (rather than static pbf files) with a schema like the following:

ss_ref_id	datetime	speed_p85
45f4b95b62f28464caca1f76e48efcb3	2018-01-05 07:00:00	34
45f4b95b62f28464caca1f76e48efcb3	2018-01-05 08:00:00	32
45f4b95b62f28464caca1f76e48efcb3	2018-01-05 09:00:00	37
...	...	...

*this schema may also contain Location Reference column(s) in practice

With each new week, we mapmatch data against the latest version of OSM. Since SharedStreets references IDs are essentially hashes against the underlying geospatial data with a tolerance of +/- 1.1m (as pointed out in https://github.com/sharedstreets/sharedstreets-js/issues/16), any osm update that moved an intersection more than 1m will lead to new Reference IDs:

ss_ref_id	datetime	speed_p85
...	...	...
45f4b95b62f28464caca1f76e48efcb3	2018-01-07 23:00:00	42
New week (osm updated)
763c212d53f8b4ba4fce92e884988c9e	2018-01-08 00:00:00	43
763c212d53f8b4ba4fce92e884988c9e	2018-01-08 01:00:00	42
...	...	...

These changing IDs would prevent us from being able to easily run aggregate queries over long periods of time – for example, to create a histogram of speeds on a segment over all of 2017. It's also ambiguous which version of tiles we would load in this scenario – if we pick tiles from the end of 2017, many of our reference IDs from earlier in the year will not match up.

From https://github.com/sharedstreets/sharedstreets-js/issues/16, it sounds like these hash IDs were never meant to match up across datasets / basemap versions, and that instead fuzzy matching on the underlying geospatial data is the way to reconcile these things. However this requires a non-trivial amount of work and seems like a significant divergence from OSMLR which had tolerance levels of ~20m to make these identifiers more stable.

From #22, it sounds like there may be ways to subscribe to changing SS References in the future, but it could be cumbersome to continuously apply these migrations to historical datasets with billions of observations.

While it's not a panacea, it seems like generating IDs using a higher tolerance level for underlying geospatial changes would increase stability and the likelihood that datasets/tiles continue matching. Is there any reason the referencing system isn't designed this way?

migurski commented 6 years ago

I’ve given this a small amount of thought, and I’m unsure that more tolerance addresses the issue. A short distance across a decimal place boundary (e.g. from 37.123999°N to 37.123400°N) would still create a new identifier, right?

schnerd commented 6 years ago

Ahhh, that is a great point. I guess OSMLR's "tolerance" levels actually referred to the distance between old and new coordinates, not necessarily anything to do with precision + hashing.

To rephrase the question then – is providing stable IDs across separate or evolving basemaps within the purview of SharedStreets? If not, what is the recommended process for reconciling datasets that were mapmatched to different basemaps or to sharedstreets geometry tiles generated from planet builds many months/years apart?

kpwebb commented 6 years ago

@schnerd this is a great question. The short answer is we absolutely want to provide stable IDs over multiple map versions, and see that as the core application of ShSt.

I'm just not sure fuzzing IDs alone fixes the problem.

First, tolerances are application and location specific. For something like urban parking we need to be much more precise than 20m and the IDs need to uniquely identify street sections at ~1-2m precision, but for traffic data on a poorly mapped rural highway 1km might be fine!

Second, roads are actually being created and moved in OSM so we need to track what happened to a given ID over time and decide if we care for a given application.

For traffic you may not care if 2017 OSM mapped the road, and 2018 OSM improved the alignment by moving it 25 meters. You may care a lot if 2018 OSM also made the road 30% longer, as that might indicate your 2017 traffic speeds were depending on an incorrect geometry.

Doesn't mean you can't compare them, but our view is that we should create a map of how IDs evolved and then generate application specific translations for old data. You could create rule-based translation for traffic IDs e.g. 2017 traffic IDs are carried forward automatically except for roads where the 2017 and 2108 lengths differ by 25% or more...

We're already processing multiple OSM planet snapshots (we've got a half-dozen versions for 2017 and 2018) but we're intending to do this ~weekly against planet, and we'd like to set up a minutely OSM changes ingest process to keep in sync. We're thinking of #22 as as way to track the modifications, so you'd have a list of everything that happened to get from the 2017 ID to the 2018 ID.

If that works for you, let's chat about a process for updating the IDs on old data at global scale.

kpwebb commented 6 years ago

Also, on your original question about tolerances, check this out:

https://beta.observablehq.com/@kpwebb/sharedstreets-api

That's a set of interactive demos backed by a global SharedStreets API applying adjustable tolerances for matching existing references to arbitrary map data (in the demos it's GIS data exported directly from Seattle DOT's open data portal).

Note that the tolerances are applied at the time of match. Mapzen was doing this association within Valhalla as they ingested OSMLR edges and linked against the Valhalla graph -- the OSMLR docs just listed the hard coded tolerances used during the match.

Our API is doing the same thing but allows ingest and association of any GIS data source. This process generates a SharedStreets-style reference for the geometry on import and then performs the match of the new reference against all existing/known ShSt references. We also generate a match "score" to rank candidate edges found in ShSt. In a complex urban highway there might be several candidates considered for each geometry. At the moment the score is RMSE of distance of new geom's start and end points from snapped locations along the matched line and difference in length of new and matched ShSt geometry.

We may switch to a HMM approach, like is used in the GPS trace matcher, but when working with GIS data it didn't seem like a good fit as we often don't know the sequence of the points and don't get time hints from GIS. Without time attached to points, the ShSt reference (start/end + length + bearing) captures most of the information need to match one segment to another.

Keep in mind this geom matching API is useful for ingesting unknown GIS data, but we can avoid this with OSM derived data sets as we know their existing way/node IDs in ShSt.

Data generated against an older map where the node IDs remain associated with the same intersections matches automatically even if the locations of the intersections or the geometry change. If we're tracking minutely updates, every node modification would be tracked and we could go just off changes using OSM IDs.

In cases when you don't know the corresponding node you can fall back to the geometry matching process for one or both of the nodes (you could pin one intersection that hasn't changed to an existing ShSt intersection ID and look for the other end...).

FWIW, we can do the same thing with proprietary non-OSM basemaps or GIS-derived data using the geometry matching API above. That example with Seattle creates an automatic crosswalk between city-county centerline data which are built off different GIS maps.

schnerd commented 6 years ago

Appreciate the detailed reply @kpwebb, this looks promising! Will dig into this a bit deeper over the next few days and experiment with the /api/geom API – after which it'd be great to chat to work through any unanswered questions.

sharedstreets / sharedstreets-ref-system

Using reference IDs as stable identifiers in historical time series data #23

*this schema may also contain Location Reference column(s) in practice