Linear referencing + stable IDs for OSM edges

kpwebb commented 9 years ago

How do we solve linear referencing in OSM?

Possible paths:

OpenLR

Existing standard. Who’s using besides TomTom & INRIX?
Advantage of not requiring new IDs/curation of IDs as basemap changes
Disadvantage is that it’s completely car-centric. Do we need bike/ped LR features? Can we extend OpenLR to include them?
- Are we creating a new standard if we augment OpenLR given hardware device interoperability?
Does the OpenLR binary format work or are we better served with standards-based/future-proof formats like PBF etc.?
Telenav TTLs
http://wiki.openstreetmap.org/w/images/1/16/SotM2012_Telenav_Traffic_Locations.pdf
Not widely adopted. Is it really “open”?
Requires ID curation
Appears to be car/traffic centric
OpenStreetMapLR
Is there an opportunity to build something that takes advantage of having a truly shared/open basemap? Or taking advantage of specific OSM characteristics/features?
Do we use the ideas/features of OpenLR but build around OSM conventions?
Do we build an ID catalog system for TTLs that works with real-time map updates?
Do the upsides of building in multimodal functionality and better exchange formats outweigh the downsides of “yet another standard”? http://xkcd.com/927/
OpenStreetMapLR + OpenLR?
Can we mitigate downsides by producing both as output streams. Compliant OpenLR for backward compatibility and OSMLR for full feature set?

kpwebb commented 9 years ago

Regarding OpenLR limitations:

We may need more than just bike/ped additions. The need to understand vehicle type both in terms of speed and network restrictions could be very important in the context of urban street networks.

These networks are increasingly stratified by vehicle type (e.g. bus/taxi have different infrastructure than cars, trucks have their own set of restrictions based on height and trip type). Tolling and other restrictions can create vastly different street topologies and observed speed spending on the type of vehicle.

Doesn't need to be solved 100% but the architecture should allow for this.

NeilTaylor1982 commented 9 years ago

+1 on being able to disaggregate linear references across the types of fleet vehicles contributing data Kevin. This is important for transport planners/city authorities. Also important in terms of demonstrating how 'mixed'/mode specific the traffic data sample you are collecting really is.

Could these potentially be identified and referenced through a combination of 'known' fleet data providers (e.g. Taxi, Bus, Bikeshare operators) using TE to contribute directly, and 'unknown' providers (e.g. crowd-sourced data from the SDK code inserted into third-party apps) crowdsourcing their data ad-hoc?

kpwebb commented 9 years ago

@NeilTaylor1982 one of the places I've heard where this vehicle type issue matters is London with its Bus+Taxi only travel lanes. I'm curious, do you know of any studies showing the impact of travel time in aggregate provided by separated lanes?

Just curious if we already know just how much faster Taxis are in London (for e.g) than typical traffic flows.

abyrd commented 9 years ago

Alternative to LR or OSM IDs

We mentioned two options for associating speed data with locations. One is an ID space (probably OSM ways between two OSM nodes) and the other is OpenLR location referencing.

I just read some of the OpenLR whitepaper (http://www.openlr.org/data/docs/OpenLR-Whitepaper_v1.5.pdf), specifically definitions and binary format. While the obvious problem with OSM is IDs shifting or disappearing, OpenLR also seems sensitive to both ends having very similar routing graphs, and seems to trade quite a bit of complexity for a moderate improvement in compactness.

My first reflex on how to implement this is slightly different, though a lot of the same ingredients appear in the white paper: I would simply key all the stored observations on the geometry of the road segment itself.

By placing lower and upper bounds on how close together and far apart points are allowed to be along a segment and storing all points as deltas, each point could be stored in two bytes (one for Dx, one for Dy in some multiple of meters). Maybe less if you get fancy and do Rice coding etc.

If you are breaking everything up into tiles, say with each segment assigned to the tile in which its first waypoint falls, even the first point can be coded as a delta from the tile corner or tile center using a variable-byte system.

If you use a deterministic algorithm to split OSM ways up into road segments, these keys will rarely change, but when they do it's no big deal. Some speeds will will start being recorded with different IDs but they can still be geometrically matched.

We're talking about something like 14 bytes for a 500m road segment ID. This is comparable to what it would cost to use an OSM way ID sliced with two OSM node IDs. Maybe add a few more bits for road-type to facilitate matching.

Just to be clear, I'm not talking about storing the sequence of points in the original GPS trace. I'm talking about converting that sequence to a speed along a segment of an OSM way between intersections, then stashing that speed with a key that is just a coded form of the OSM way segment's geometry.

laurentg commented 9 years ago

I think the solution depends on the context of use: internal ID vs an exchange format with the external world. In the following I'm only talking about an internal ID system.

OpenLR has been designed to be a "map agnostic" export format (at least regarding IDs) and try to solve several things at once (it spans several layers from low to high-level description), althought sometimes not very cleanly.

It is a replacement / extension for the TMC format, and it's main use case is exchanging traffic data with memory/bandwith constrained navigation devices.
The protocol is rather complex for what it does, make many assumptions, and is not really expandable. It rely a lot on delta encoding, bit packing, etc...
The use of shortest paths has been designed initially to optimize encoding size, and is probably not very useful for internal referencing. This also has the side effect of not guaranteeing a unique reference for a same atomic element/edge (two distinct shortest paths can use common edges), detecting duplicates is probably rather hard. Ensuring ID uniqueness would probably mean dropping the shortest path feature of OpenLR.

Regarding OSM ID-based internal referencing (for example way/from-node/to-node IDs), the main issue is about backward compatibility across OSM data change. Maybe we should experiment to see what changes are really breaking references (way splitting, topology-nodes shift, ...) and how they could be mitigated (keeping an index of valid IDs alongside OSM data timestamps, migrating existing IDs...). One advantage of using from-to node IDs is that the topological information is implicit: even a simple database query can do a limited graph search for surroundings edges (could be useful for example for implementing the shortest path search when generating OpenLR data). Also, for OSM users, querying the relevant information is straight-forward.

Regarding geometry-based referencing, I'm not sure how this would solve the issues raised above (using OSM ID-space). A change in OSM data that impact OSM IDs would probably be also impacting the exact geometry and thus also break references. Also, by skipping topological information in the referencing system, we have the risk of confusing similar overlapping geometries (I'm thinking of bridges, tunnels, 2 parallels ways shifted by a few meters...). All this would be hard to migrate properly without error. By using OSM-ID referencing, we could use heuristics based on the way/node properties (name, way type, direction...) for migrating. Migrating could be a slow process that would need some off-line processing; AFAIU implicit geometry rely on "on-the-fly" migration/searches which could be costly in term of processing power.

One solution could be to mix the two latter approach: use OSM-IDs (of the "current" version of the OSM data against which the server works), and index all current IDs with a list of previous IDs, alongside the geometry (packed), a timestamp of the OSM data it was based on. We could also add a "matching factor" computed during the migration process: for example an almost exact match (way properties, geometry...) would have a factor near 1, and a more fuzzy match (way type modified, geometry moved, etc...) would have a lower factor, near 0. That way, when using historical data, one could easily use previously "partially invalid" data by weighting them with this factor.

For external data exchange, OpenLR is probably a good choice, at least among others, but the decision of picking which format to export real-time/historical data has probably few design impact so maybe could be left for later on.

abyrd commented 9 years ago

@laurentg absolutely agreed, internal storage need not be too closely coupled to external exchange format. My concerns about OpenLR were similar to those you outline above, and I appreciate the added context you gave on that format. I struggle to see much of an advantage in OpenLR-style shortest path representation in our case.

I agree that we need to carefully consider how we could mitigate OSM ID change problems. Perhaps it is possible, but even if we were able to migrate all stored observations to current IDs, that operation would not be instantaneous and we'd always be confronted with the questions: Which version of OSM is the requestor referring to? Which version is our speed database referring to?

Regarding geometries as keys, I am in fact suggesting that requests be formulated in terms of geometries (which may themselves be lifted from current OSM ways) and matched to stored observations on the fly. The datastore would be one big collection of geometry fragments with times and speeds attached to them. As observations get old, they could be merged into historical averages. With proper use of bounding boxes and indexes this may not be a particularly heavyweight operation. I say "suggesting" and not "advocating", because we'd obviously need to evaluate the continuous computational load this implies, but it might turn out to be very tolerable, a question of continuous low-level load versus occasional stop-the-world computation.

abyrd commented 9 years ago

About losing topology information: the geometries would have a sense of direction, which would be used in matching. This would usually disambiguate crossing roads, but not tunnels that are co-linear with surface roads (there are quite a few of those in Brussels). But consider our data sources (moving vehicles). We won't actually know which of the roads they were on either!

laurentg commented 9 years ago

Topological information can be useful when resolving a GPS trace to the proper segments: you can use the fact that the vehicle at the source of the trace has to follow a valid path, so for a trace that partially cover two overlapping segments the complete context can usually help distinguish which segment is used. Of course for this you need more complete or longer traces. I'm thinking of complex highway exchanges for example.

For fixed-point speed sources (induction loops, camera, etc...) this trace information is not available, but as the source can be situated at a place where the location only is not enough to properly distinguish between different streets (briges, tunnels, two ways direction), the source may have to be properly resolved to a "segment" by some static "configuration" (for a two ways street that is even necessary if you do not include a direction vector in the input).

Just as remark: If we store un-resolved geometrical information, I do not see really the difference with simply storing only a set of GPS trace? The "traffic engine" would not do much then?

Maybe data storage could be split in two: one "un-resolved low-level" (GPS trace only) with only geographic and time information; a second "resolved high-level" with appropriate mapping (OSM ID with potential versionning), and eventually appropriate averaging (spatial and time).

Anyway, as for your question of which version of OSM is a data referring to, we have to resolve this question at one point or another: a GPS trace is valid only for a "reality" which can change over time. OSM mapping is supposed to represent the reality in "real-time". For some changes (for example new traffic lights, speed bumps, speed limits, new/removed lanes, etc...) the speed profile would become invalid, even though the geometry would stays valid. To properly handle this, resolving to some OSM "model" may be necessary.

NeilTaylor1982 commented 9 years ago

@kpwebb. The London case you cite is a good example, but unfortunately I'm not aware of any studies on these average speed differentials in this context. We regularly experience a similar issue when trying to determine sample bias from 'floating vehicle' data collected through surveys in emerging cities. The main reason (as you may remember from our work in Cebu) is that there are often significant differences between informal public transit services that stop frequently, taxis that stop less often and travel much more quickly, and private car/motorcycles which are invariably somewhere in between. This is context specific though, and in Cebu I think we found the taxi speeds were ~20% faster (both in free-flow and peak periods) than for the smaller sample of mixed vehicle fleet observations we had previously recorded.

As a rule it is really useful to be able to strip public transport modes out from others, in order to understand whether variance between different motorised modes' average flow speeds/journey times is likely to be related to pubic transport operations (stopping / layover) in specific locations or other possible factors (lack of priority, highway constraints, congestion affecting all modes).

opentraffic / architecture