suboptimal (?) transition probability metric

graphhopper / map-matching

The map matching functionality is now located in the main repository https://github.com/graphhopper/graphhopper#map-matching

https://www.graphhopper.com/open-source/

Apache License 2.0

786 stars 274 forks source link

suboptimal (?) transition probability metric #86

Open kodonnell opened 7 years ago

kodonnell commented 7 years ago

Currently, our transition probability is higher for transitions which are close to a straight line. I see a few problems with this:

it's possible that the the wrong transition is favoured. E.g. say I have a candidate which is 100m away, but in exactly a straight line. If I have another candidate which is only 10m away, but slightly off a straight line, then the former will actually be favoured.
shouldn't we be using time, not distance (or at least allow the user to configure this)? The transition probability is meant to indicate the likelihood of transitioning (funnily enough) from one candidate to the next - and personally I think that most people think in times not distances. E.g. if I have two routes and one if 5 minutes faster than another, I'm more likely to take that, regardless of the length of either route. Put another way, changing to time-based transitions would mean that e.g. travelling along highways is more favoured than side roads. Again, we should easily be able to configure this.
I'm not sure we need to scale by the square of the time difference - if this scales all of the probabilities by the same factor, then won't this leave the final viterbi sequence unchanged? We'd probably remove this if we did time-based transitions anyway ...

That all said, I suspect most of these will only have noticeable differences when routing over larger distances (e.g. hundreds of meters).

Thoughts @stefanholder ?

stefanholder commented 7 years ago

it's possible that the the wrong transition is favoured. E.g. say I have a candidate which is 100m away, but in exactly a straight line. If I have another candidate which is only 10m away, but slightly off a straight line, then the former will actually be favoured.

Not really, because the transition metric also uses the linear distance between both GPS positions (see also the discussion here and below). So the transition metric favors straight connections between both GPS coordinates. This makes sense because people usually don't make detours during a trip.

shouldn't we be using time, not distance (or at least allow the user to configure this)?

We are already computing fastest routes using fastest weighting instead of shortest routes. So in the map matched route, highways are actually preferred over side roads.

I'm not sure we need to scale by the square of the time difference - if this scales all of the probabilities by the same factor, then won't this leave the final viterbi sequence unchanged?

Only the transition metric is scaled by the square of the time difference and this only affects the transition probability but not the emission probability. The intention was to compensate for the fact the non-normalized transition metric is usually larger for larger time intervals between GPS coordinates but in practice this is usually not needed. Hence, I would recommend to use the non-normalized metric instead .

kodonnell commented 7 years ago

So the transition metric favors straight connections between both GPS coordinates.

Doesn't that illustrate my point? I.e. it favours straight connections, not the shortest/fastest connection. I'd assume that (unless they're particularly concerned about their tyres) individuals take the fastest/shortest route without regard for how straight it is.

We are already computing fastest routes using fastest weighting instead of shortest routes. So in the map matched route, highways are actually preferred over side roads.

Correct - but we compute the transition using distance metrics not time. So, while for a given candidate pair we may (depending on user configuration) pick the fastest route (i.e. highway vs side road), when comparing different candidate pairs (i.e. the transition probability in viterbi), we only use distance (so e.g. a side road may be favoured over the faster highway route).

I would recommend to use the non-normalized metric instead .

OK - but let's continue the discussion above/below first.

As a proposal, what about this? The transition probability is:

higher for shorter routes. If the weighting is 'fastest' (as provided by user) then 'shorter' means 'shorter in time'. If the weighting is 'shortest', then 'shorter in distance'.
lower when the speed (path distance / gpx time diff) significantly exceeds the expected speed limits.
ignored (which tells viterbi no transition is possible) if the path distance exceeds a maximum (provided by user), the path speed (path distance / gpx time diff) exceeds a maximum (provided by user).
ignored or penalised if an inner-link U-turn occurs.

michaz commented 7 years ago

As a proposal, what about this? The transition probability is: higher for shorter routes.

Hmm, but that's an assumption about the behavior of people (a behavioral prior), whereas the current transition probability only makes an assumption about how movement constrained by a road networks looks piecewise between observations (!) compared to straight lines, i.e. they are similar in length, and more so the fewer time has passed.

That's a rather big theoretical difference. I'm not saying it's a bad idea, I'm just saying it opens up new issues. The argument is the same as with U-turns: People do take unlikely routes, and there's no immediate reason why map-matching should be biased towards 'likely' routes in a behavioral sense. I think the current transition probability is designed to minimize this bias.

Maybe my vehicle is a snow plough and passes through all roads with uniform prior probability. Maybe I just like driving back and forth. Maybe I do take wrong turns and turn back a lot and want to see that on the map when I get home. Maybe I'm a fire truck and don't obey traffic rules.

michaz commented 7 years ago

Maybe what we are doing is wrong and we should completely admit all kinds of U-turns (as we agree, if it's only about distance, entering a link and going right back is continuous with not doing it, it's just epsilon away), and rather do some "behavioral smoothing" as a post-processing step?

Something like, if a piece of the final account of what happened (the route) contains very little information (accounts for very little of the matching error) compared to its complexity (turning into a new street and coming right back), we drop it.

michaz commented 7 years ago

(On second thought, maybe the "penalize turns, in general" idea which someone already had would do much the same but easier.)

michaz commented 7 years ago

I just wouldn't think of it as "turns are unlikely", but as "turn instructions in the result must pay rent in increased accuracy."

karussell commented 7 years ago

Maybe a general solution is already to let the user specify the costs (time?) for u-turns at measurements. Avoiding/allowing u-turns inside a route should be considered a separate problem.

I just wouldn't think of it as "turns are unlikely", but as "turn instructions in the result must pay rent in increased accuracy

What do you mean here?

michaz commented 7 years ago

But why? What's the real-world reason why a U-turn at a measurement should be more or less likely than anywhere else? Having that parameter would look like an artifact of the algorithm/implementation. :-(

karussell commented 7 years ago

U-turns for simple A to B routes are often meaningless or should be forbidden because e.g. turn restrictions could be tricked and it is not easy to decide whether a u-turn makes sense or is allowed on a road. (Or to explain it differently: we just have not the necessary data from OSM to judge about this good enough IMO)

And for measurements or 'via point' in general it is a lot more likely to have u-turns at those points because it could be a 'not continuing' route like for GPS traces of taxis picking up passengers and using the opposite direction for the next route.

If we would have good rules (and all the necessary good street data!) when to introduce U-turns in the general routing then probably we could avoid this separate handling and could even derive the more exact place of the happened U-turn instead of assuming this from the measurements (hopefully frequent enough).

kodonnell commented 7 years ago

Hmm, but that's an assumption about the behavior of people (a behavioral prior), whereas the current transition probability only makes an assumption about how movement constrained by a road networks looks piecewise between observations (!) compared to straight lines, i.e. they are similar in length, and more so the fewer time has passed.

I agree that we make an assumption about people (and, as above, we could allow people to choose their behaviour e.g. fastest/shortest/straightest). However, as above, I think 'people travel the fastest route' is a much more reasonable assumption than 'people travel in straight lines'. My guess is that the original intention was to favour shorter routes, but as above, this can fail (as it prefers straighter routes over shorter ones). In addition, as above, 'the fewer time has passed' is clearly wrong - there is no direct link between the straightness of a path (as determined this way) and the time taken to travel it.

People do take unlikely routes, and there's no immediate reason why map-matching should be biased towards 'likely' routes in a behavioral sense. I think the current transition probability is designed to minimize this bias.

I believe our algorithm should give the closest result to reality, in as many cases as possible (optionally with some configuration to specify the case, e.g. "I'm in a car", or "I prefer shortest distance" or "I'm a hyper-miler", etc.). To say that something unlikely can happen with equal likelihood as something more likely is (obviously) not going to help. Otherwise, we might as well give up - if we try to allow unlikely cases (e.g. your example of people driving back and forth just so they can see it on a map), then we'll never get anywhere. Put another way, in this context the most unbiased solution would be to return the entire road network as a solution (i.e. 'you could have gone anywhere at anytime'), or even the original GPX track (i.e. 'you travelled somewhere that gave you these GPX points'). That's clearly not useful to users. As to bias - we currently have a systematic bias toward unlikely events occurring more than in reality. That's, well, biased = )

rather do some "behavioral smoothing" as a post-processing step?

We are already doing some 'behavioural smoothing' - we assume that people travel 'sensibly' on roads. It's not explicitly called it in the code, but that's the intention of most of it. That said, our smoothing is only between candidates. We can try to add some behavioural smoothing across candidates (good work @stefanholder on U-turns), though we're limited in that we're currently a first order HMM solved by standard viterbi (i.e. we only know about the previous candidate). If e.g. we were second order (i.e. know two previous candidates) then we could more easily penalize (in a configurable way) U-turns and the like.

What's the real-world reason why a U-turn at a measurement should be more or less likely than anywhere else? Having that parameter would look like an artifact of the algorithm/implementation.

@karussell's already explained this, but to continue in my terminology - we already have a systematic bias toward allowing U-turns at measurements as opposed to within a route i.e. they are already an 'artifact of the algorithm/implementation'.

stefanholder commented 7 years ago

Doesn't that illustrate my point? I.e. it favours straight connections, not the shortest/fastest connection. I'd assume that (unless they're particularly concerned about their tyres) individuals take the fastest/shortest route without regard for how straight it is.

Favoring straight connections is equivalent to favoring shortest routes because the linear distance between A and B is a lower bound for the length of any path between A and B.

Correct - but we compute the transition using distance metrics not time. So, while for a given candidate pair we may (depending on user configuration) pick the fastest route (i.e. highway vs side road), when comparing different candidate pairs (i.e. the transition probability in viterbi), we only use distance (so e.g. a side road may be favoured over the faster highway route).

When comparing different candidate pairs, we are actually comparing the distances of the fastest routes between the candidate pairs. But I agree that this is somewhat inconsistent. If we have low frequency GPS data and timestamps are given for GPS measurements (which we should not assume in general), we could compute the absolute difference between Path.getTime() of the fastest path and the timestamp difference instead. For high frequency GPS points using shortest paths is sufficient.

Another important point is if/how to normalize the transition metric. In this online map matching paper, the transition metric is normalized with the path length. Another alternative would be to normalize with the linear distance of the candidate pair. Either way, normalizing by a distance instead of normalizing by a time has the advantage that we don't even need timestamps. Before implementing a new transition metric, we should verify its probability distribution empirically as done by Newson & Krumm (Figure 5).

Maybe a general solution is already to let the user specify the costs (time?) for u-turns at measurements.

I agree.

Maybe what we are doing is wrong and we should completely admit all kinds of U-turns (as we agree, if it's only about distance, entering a link and going right back is continuous with not doing it, it's just epsilon away), and rather do some "behavioral smoothing" as a post-processing step?

I think that the (first-order) HMM is a powerful tool, which allows us to try out all kinds of the emission/transition metrics and probability distributions. So far I don't see why we should use any addition processing steps.

though we're limited in that we're currently a first order HMM solved by standard viterbi (i.e. we only know about the previous candidate). If e.g. we were second order (i.e. know two previous candidates) then we could more easily penalize (in a configurable way) U-turns and the like.

IMO a first-order HMM is sufficient to penalize U-turns or other turns but I would be happy to be proven wrong (ideally with a sample GPX trace). Moreover, using directed candidates with a first-order HMM was rather easy to do. For me the complicated part was figuring out how to use the Graphhopper internals. Having said this, I would be happy for any contributions to https://github.com/bmwcarit/hmm-lib.

karussell commented 7 years ago

For me the complicated part was figuring out how to use the Graphhopper internals.

BTW: Thanks for doing the necessary work! And always appreciated to improve stuff or make it more clear :) !

kodonnell commented 7 years ago

Favoring straight connections is equivalent to favoring shortest routes because the linear distance between A and B is a lower bound for the length of any path between A and B.

I understand that, but it only applies when you're comparing distances between fixed end points A and B. In our case we have a fixed start point but multiple (candidate) end points. See my first bullet point above for an example.

If we have low frequency GPS data and timestamps are given for GPS measurements (which we should not assume in general), we could compute the absolute difference between Path.getTime() of the fastest path and the timestamp difference instead

Except then it may become unrealistic (without traffic data). E.g. if you're on a highway and there's a traffic jam (and hence the time difference is large), it might prefer a longer side road. Again, I still feel the 'safest' is 'favour the fastest candidate pair'. This does fail (as sometimes people take slower routes) but I suspect less so than others. Of course, #89 would help.

As an aside - are we trying to cover the case of no timestamp info? If so, it seems to me it should need a different algorithm - adding timestamp info adds more information, so we should be able to do better than without it (and hence we need different approaches).

Another important point is if/how to normalize the transition metric

I'm still not fully understanding why we need a normalization - is this so that transitions probabilities are always comparable to emission probabilities, regardless of path length? (E.g. if transition was just exp(-a*pathDistance) then for long distances the transition probabilities would effectively be ignored, and it'd only be based on emission probabilities.)

Is it possible to normalize such that all transition possibilities (for a given to/from step pair) sum to one? If we also do the same with emission probabilities, then it might be OK.

So far I don't see why we should use any addition processing steps.

Agreed.

IMO a first-order HMM is sufficient to penalize U-turns or other turns but I would be happy to be proven wrong (ideally with a sample GPX trace)

I still can't figure it out, so I'll have a look at your code, and see if that makes it clearer.

Finally - does it make more sense to put this aside and work on #89 first? Otherwise all we have are our opinions (biased by our use cases etc.). In addition, other work is being doing (e.g. #88) which has changed the transition metric anyway.

stefanholder commented 7 years ago

In our case we have a fixed start point but multiple (candidate) end points. See my first bullet point above for an example.

The non-normalized transition metric is Math.abs(linearDistance - routeLength), where

linearDistance is the linear distance between the adjacent GPS measurements and hence does not depend on the candidates (note that the linearDistance is computed outside both for loops)
routeLength is the length of the shortest/fastest route between both candidates

Assuming that linearDistance < routeLength, which is usually the case, then for any two candidate pairs the one with shorter route length will have a higher transition probability. The same also holds for the normalized transition metric. This means that in your example the candidate that is only 10m away would be favored.

Except then it may become unrealistic (without traffic data). E.g. if you're on a highway and there's a traffic jam (and hence the time difference is large), it might prefer a longer side road. Again, I still feel the 'safest' is 'favour the fastest candidate pair'.

Not sure if I get your point, but just using the time of the fastest route or the length of the shortest route might also work.

As an aside - are we trying to cover the case of no timestamp info? If so, it seems to me it should need a different algorithm - adding timestamp info adds more information, so we should be able to do better than without it (and hence we need different approaches).

I think that the user should be able to specify if timestamps should be used. In some cases there are just no (correct) timestamps available. The non-normalized metric (see above) actually does not require any timestamps. I don't think that we need a totally different approach if timestamps are available, we can then just perform some addition validity checks. If there are ideas for totally different approaches with timestamps we can try them out with #89.

I'm still not fully understanding why we need a normalization - is this so that transitions probabilities are always comparable to emission probabilities, regardless of path length?

Yes but I would say regardless of the time interval between GPX entries.

Is it possible to normalize such that all transition possibilities (for a given to/from step pair) sum to one?

The problem I see with this is that transition probabilities would be higher for time steps with few transitions and lower for time steps with many transitions. This is bad because transition probabilities of different time steps also need to be balanced.

Finally - does it make more sense to put this aside and work on #89 first? Otherwise all we have are our opinions (biased by our use cases etc.)

Agreed but it was also good to do some brainstorming on possible transition metrics.

kodonnell commented 7 years ago

note that the linearDistance is computed outside both for loops

Ah, that's the key point I missed - for my use case I've had to use snapped coordinates (for other reasons), so it was my fault all along! I now (finally) see that it's effectively favouring the spatially shortest route between events. Thanks for taking the time to get it through to me.

To your above point about time point - yes, we could use that, but we could also use spacetime distance. So we can give the user options, and use #89 as you say.

I don't think that we need a totally different approach if timestamps are available

My point is that, as a general rule, if you add data to a model, it gets better. So while we can have one model that works without timestamp data, I suspect a model specifically built to include timestamp data may be better.

but I would say regardless of the time interval between GPX entries

This is bad because transition probabilities of different time steps also need to be balanced

Are you saying it would give different results if e.g. the transition probabilities for A -> B were scaled by a factor of a 100 (arbitrarily) vs those from B -> C? I.e. absolute size across different timesteps matters? I had assumed this wasn't the case.

kodonnell commented 7 years ago

As an aside, I'm having reasonable success with just using a scaled route distance (e.g. routeLength / 10000) as the transition metric.

To be fair, my use case is quite special - I've refactored to route using only real nodes (as mentioned here), and I generally have long separation between events (e.g. kilometres), and my linear distance is between candidates (not GPX tracks, as above). So it definitely won't apply in general. However, it's much less 'noisy' - I was seeing quite a lot of cases where e.g. a long route was preferred instead of the 'obvious' short one, or some silly back-and-forth ones, and they're largely gone.

stefanholder commented 7 years ago

Are you saying it would give different results if e.g. the transition probabilities for A -> B were scaled by a factor of a 100 (arbitrarily) vs those from B -> C?

I was actually wrong. Transition and emission probabilities for individual time steps can be scaled arbitrarily without changing the Viterbi results. However, this also means that scaling emission/transition probabilities cannot be used to normalize the transition metric. So scaling/normalizing only makes sense for the transition metric itself but not for the transition probability.

Note that subtracting the linear distance from the route length is also a form of normalization and maybe this suffices.

my linear distance is between candidates

You wrote that you just used routeLength / 10000, so why is there linear distance involved?

kodonnell commented 7 years ago

However, this also means that scaling emission/transition probabilities cannot be used to normalize the transition metric. So scaling/normalizing only makes sense for the transition metric itself but not for the transition probability.

Sorry, I'm not following why we can't just normalize the transition probabilities (for each step), regardless of metric? Actually, should transition/emission probabilities be normalized (i.e. sum to 1) by definition? I'm not an expert, but e.g. the Wikipedia examples use this property. Anyway, the additional benefit of this is that we could (depending how we do it) remove sigma/beta parameters completely. E.g. transition probability for A -> B as

routeLength (A -> B) / sum_over_to_candidates_T(routeLength(A -> T))

And we can easily change routeLength -> routeWeight to allow general weightings. This seems to make a lot of sense to me, though I may be missing something key. And again, there's probably not much point trying until we've got #89 .

You wrote that you just used routeLength / 10000, so why is there linear distance involved?

There isn't - it was more that if I use the existing metric (and linear distance) it doesn't work as well as it does for other matching tasks (where linear distance is that between GPX points).

stefanholder commented 7 years ago

Sorry, I'm not following why we can't just normalize the transition probabilities (for each step), regardless of metric?

We could do this but this wouldn't change anything. The Viterbi algorithm finds the candidate sequence with highest probability and the probability of a candidate sequence is defined as the product of all emission and transition probabilities on this sequence. If we multiply all transition probabilities for a given to/from time step pair with a normalizing constant then the probability of each candidate sequence is multiplied with this normalizing constant and the order of all candidate sequences from highest to lowest probability stays the same. So the effect of this is similar to multiplying the numerator and denominator of a fraction with the same number.

Actually, should transition/emission probabilities be normalized (i.e. sum to 1) by definition?

We use the Gaussian/exponential distribution to convert the emission/transition metrics to probabilities. Since both these probability distributions have continuous random variables (the metrics), individual emission and transition probabilities (or more precisely probability densities) may even be greater than 1. Another way to see this is that we can have arbitrarily many candidates at each time step by making the search radius bigger and bigger.

I'm not an expert, but e.g. the Wikipedia examples use this property.

Not sure what you mean exactly but the Viterbi algorithm usually just uses maximum and product to compute the probability of the most likely sequence. In my Viterbi implementation I add log probabilities instead of multiplying probabilities for better numerical stability but this is just an implementation detail.

Anyway, the additional benefit of this is that we could (depending how we do it) remove sigma/beta parameters completely.

The purpose of the Gaussian and exponential distribution we use is not just to convert a metric into a normalized number but to model the real distribution of the emission and transition metric. In Newson & Krumm, Figure 5 you can see that the exponential distribution is indeed a good model for our transition metric.

kodonnell commented 7 years ago

We could do this but this wouldn't change anything.

What you describe is what I was hoping to hear = ) Doesn't it show that there's no need to normalize? That said, I think there is a need to normalize such that emisison and transition probabilities both contribute materially to the result. (E.g. if emission were constantly near 1, and transition near 0.001, then the match could end up similar to standard waypoint graphhoppering.) Hence, if we do normalize, why not use something simple like the sum equals one? Then we can set a single parameter for deciding how much to favour emission versus transition (which currently isn't really possible).

In Newson & Krumm, Figure 5 you can see that the exponential distribution is indeed a good model for our transition metric.

I don't put too much weight in that. They've plotted a contrived variable against another contrived variable (which isn't normalized by time?), and have achieved an approximate alignment. That doesn't say much to me, especially as I'm guessing you can get very similar distributions many different ways. In addition, that only applies for their very specific data set, and one would really need to vary the data sets (and sampling rates) to decide this was best. Finally, we don't actually care about what the best model for transition/emission probabilities is - we care about what gives the best map-matching results. So, we're back to #89 - though discussing this sort of thing beforehand is good.

stefanholder commented 7 years ago

Hence, if we do normalize, why not use something simple like the sum equals one?

Depends on which sum you want to be equals to one. As I wrote before, making all transition probabilities of each to/from time step sum to one doesn't make sense as it doesn't change the final scoring of candidate sequences.

Then we can set a single parameter for deciding how much to favour emission versus transition (which currently isn't really possible).

Currently, only beta should be used for this. Sigma models the GPS noise and should be adjusted based on the GPS accuracy alone. Hence, we also have only one parameter for balancing emission vs. transition probabilities currently.

I still see your point that you want to have only one parameter left and you're welcome to show with #89 if/how this can be achieved.

kodonnell commented 7 years ago

As I wrote before, making all transition probabilities of each to/from time step sum to one doesn't make sense as it doesn't change the final scoring of candidate sequences.

That's what I mean = ) But we're repeating ourselves, so yes, let's wait until #89. @karussell is planning to merge map-matching into graphhopper soon, so I'll probably wait for that.