WIP: handling viterbi breaks as multiple sequences

kodonnell commented 7 years ago

Objective

If the viterbi sequence breaks (e.g. no candidates found for a given GPX entry, or no transitions between candidates) then instead of throwing an exception, split into a new sequence. As far as I'm aware there's currently no easy way for the user to do this manually, let alone easily.

Questions

is this sensible behaviour for the user?
should we return a single MapMatch result (with a single time/distance) - as it implemented in this first commit - or multiple MapMatch results (each within their own time/distance). I prefer the latter (with a reason why each break occurred).
specific comments below
anything else?

Cc @karussell @stefanholder

stefanholder commented 7 years ago

is this sensible behaviour for the user?

Yes, this makes sense.

should we return a single MapMatch result (with a single time/distance) - as it implemented in this first commit - or multiple MapMatch results (each within their own time/distance). I prefer the latter (with a reason why each break occurred).

I think that MatchResult should contain the matched GPS position for each original GPS position, as discussed above. Then each such entry could also contain a field indicating if/why an HMM break occurred. I think this is easier to handle by the user than multiple map match results. The user might not even be interested in HMM breaks and getting a single MapMatch result is then easier to handle.

kodonnell commented 7 years ago

I think that MatchResult should contain the matched GPS position for each original GPS position, as discussed above. Then each such entry could also contain a field indicating if/why an HMM break occurred.

Ah, that's a good idea. So, to clarify, we basically want one MapMatch result which includes the final route, as well as all the original GPX entries. Each entry can be inspected to know:

the edge (of the route) that it is on (if any)
some sort of status flag - 'start of sequence', 'single candidate', 'not used in map matching as too close to previous candidate', etc.

I'd suggest we also add a more general status flag to the MapMatch result e.g. 'contains sequence breaks', so users can easily check for this. And maybe we can warn the user too. (Otherwise if it's used blindly, we may end up with lots of issues submitted asking why there is a 'break' in the matched route!)

Anyway, with this, we should be able to describe all situations. However, as you say, most users will probably only be interested in the final route.

@stefanholder - does that sound good to you?

stefanholder commented 7 years ago

@stefanholder - does that sound good to you?

Yes, sounds good.

I would put it like this: MatchResult should contains the final route as well as a list of match entries.

Each match entry contains:

the original GPX entry
the map matched lat/lon
the edge (of the route) that it is on (if any) and the offset from the start of the edge (or alternatively the fraction as offset divided by edge length)
HMM break status: 'no break', 'no candidates', 'no transitions to this entry'
'used for map matching' or 'not used in map matching as too close to previous candidate'

kodonnell commented 7 years ago

I would put it like this:

Great, sounds good. I'll look over your other PR first, and then have a play with all of the comments above.

kodonnell commented 7 years ago

You can ignore the code for now (it's still unfinished), but I'd like some feedback on progress thus far. Specifically:

I realised I didn't want to directly add in the GPX entries not used for map matching. Why? Because if the user e.g. loops through all of the GPX entires (including those not used for map-matching) they resulting sequence may look a lot more erratic and behave unexpectedly. (E.g. if I go from A -> B, when I add in the ignored points, it might be 'go toward B, then back toward A, then toward B ...'.) I think it makes more sense to think of clustering - i.e. points that are close to each other are the 'same'. For this reason, I do not snap the GPX entries (that aren't used) to the final route - but merely associate them with their 'cluster'. This means we avoid the erratic nature, but also that we can find where any of the original GPX entries is in the final route (up to the GPS error).
- an alternative option is to leave it as is ('ignored' GPX entries are ignored from the result), but allow the user to change the filtering to prevent e.g. any filtering at all. This is simpler, but will slow down the matching.
I think the MatchSequence does most of what @stefanholder suggested above (though maybe not tidily yet). I provide a MatchResult.originalGPXMapping which allows the user to map (by index) a given original GPX entry to the matched one - and this contains information like the matched edge, how far along, the snapped point, the to/from time, etc. Alternatively one can traverse the matches directly.
- a tidier (API and code) approach would be to add properties directly to the original GPXEntry. It means not using the GH GPXEntry (just extending it) but this probably isn't a big deal.
when we have sequence breaks, should I fill the break with an 'empty' sequence break, to keep the times contiguous? I.e. that way a user should always find a sequence at a given time.

The main changes remaining are

[x] refactoring to get it nice and clean (I'll wait a few days and come back with a fresh view)
[x] make sure the to/from times are contiguous.
[x] trivial-ish refactoring of tests, etc.
[ ] sequence-specific tests.
[ ] refactoring the web GUI to handle sequences.
[ ] update all doco

stefanholder commented 7 years ago

an alternative option is to leave it as is ('ignored' GPX entries are ignored from the result), but allow the user to change the filtering to prevent e.g. any filtering at all. This is simpler, but will slow down the matching.

I would prefer this alternative because it's simpler and I think the user should have the option to disable the filtering.

a tidier (API and code) approach would be to add properties directly to the original GPXEntry

I think it's better to have a separate class (GPXMapping) for the matched GPX entry.

when we have sequence breaks, should I fill the break with an 'empty' sequence break, to keep the times contiguous?

If there are no candidates for a GPX entry then GPXMapping.matchEntry should be null (not sure if this is what you mean).

kodonnell commented 7 years ago

Thanks @stefanholder - I will follow your first two preferences.

If there are no candidates for a GPX entry then GPXMapping.matchEntry should be null (not sure if this is what you mean).

More if e.g. we have a sequence A->C then an HMM break so start a new sequence D->F. In this case, there's no (explicit) information about what happened between C and D - we could leave it to the user to interpret, or I could add an 'unknown'/'empty' sequence C->D which really only contains the information about the start/end points (including time) but e.g. the route between them will be 'unknown'. Does that make sense?

kodonnell commented 7 years ago

To keep PR from getting even bigger, I created #92 for allowing customisable filtering. Otherwise, this PR is stabilising somewhat. I've renamed a lot of the entities - the names were misleading, and were confusing me (let alone new developers). E.g. GPXExtension -> Candidate, TimeStep -> ViterbiMatchEntry. I've updated the list above with some more tasks remaining too.

should I fill the break with an 'empty' sequence break

For now I'll leave this out - it's non-essential, and isn't too hard to add later if it's desired.

stefanholder commented 7 years ago

More if e.g. we have a sequence A->C then an HMM break so start a new sequence D->F.

If an HMM break happens because there is no transition between C and D then we can set the HMM break status for the GPXMapping of D to 'no transitions to this entry', which is quite simple to do.

kodonnell commented 7 years ago

If an HMM break happens because there is no transition between C and D then we can set the HMM break status for the GPXMapping of D to 'no transitions to this entry', which is quite simple to do.

That's already done. To rephrase another way - say time at C is 10am and time at D is 11am. If the user currently goes to find 'what was the state at 10:30am', they could get confused, as they won't find anything (due to the sequence break). If we added a 'between break' sequence C->D, then they'd find that at 10:30am it was in an between-break state. In other words, should we explicitly or implicitly identify the presence of a sequence break? Put another way, each sequence should be contiguous in time, but the sequence of sequences is not, and maybe it should be.

kodonnell commented 7 years ago

OK, finally got back to this. @karussell / @stefanholder, I think the map-matching logic/API is reasonably stable, and the only remaining work is the web stuff + doco + tests - which I'll get to if you both approve of the logic/API. Quick summary of some of the major changes (which should all be in code doco):

wrap GPXEntry's as MatchEntry's to include match info. Allows one to easily see how the input GPX was used in the matching, and where it was matched to.
MatchEntry's are wrapped as ViterbiMatchEntry's to allow their use with hmm-lib.
MatchResult holds a sequence of MatchSequences which holds a sequence of MatchEdges.
Temporal gaps between sequences are filled with 'UNKNOWN' sequences (as discussed above).
LocationIndexMatch logic was moved to ViterbiMatchEntry.findCandidateLocations as it seemed to make more sense there. This requires LocationIndexTree.XFirstSearchCheck being public not protected.
Removed the logging code as I didn't want to spend time refactoring that while I was developing. Can add it back in if desired.

stefanholder commented 7 years ago

Removed the logging code as I didn't want to spend time refactoring that while I was developing. Can add it back in if desired.

I think that the logging code is desired, at least until we have ways to visualize candidates and matched paths. I strongly relied on the logging in PR #81 and PR #91 and I think that it will be useful for other map matching improvements. Maybe it's a good idea to put the logging code into separate methods to reduce the noise in the core code.

stefanholder commented 7 years ago

Seems that you need to merge master into your branch because PR #91 has been pushed into master in the meantime.

kodonnell commented 7 years ago

logging code

I'm with you - it can make debugging easier, but it can also make the core harder to develop (at least with bigger changes like this). As a side note, useful logging is often specific to the feature/bug worked on at the time, and isn't generic. I don't know what the best approach is - I'm happy to add it in, or leave it out, or refactor it into other methods (though I'm not sure how to do that).

need to merge master

Will do. I'd prefer to do the merge once I know this PR is likely to be included.

stefanholder commented 7 years ago

I'm happy to add it in, or leave it out, or refactor it into other methods (though I'm not sure how to do that).

I would then suggest that you add the logging in again and I will refactor this into separate methods in another PR. This should also make future refactoring of the map matching code easier.

kodonnell commented 7 years ago

Cool - logging added back in.

kodonnell commented 7 years ago

Sorry, I was waiting on you both but had forgotten to merge master. How does it look?

karussell commented 7 years ago

Is this still a WIP? I've not checked it closely yet, maybe @stefanholder did? The performance implications and also the renaming should be discussed (necessary? better names?)

I've also created this here and can fix if this is required https://github.com/graphhopper/graphhopper/issues/971

kodonnell commented 7 years ago

Is this still a WIP?

Yes. The actual map-matching logic is (nearly) ready for review, but there are still other tasks as per here. I'm happy to polish off the rest if there's some indication this PR is 'good'.

The performance implications ...

key	current master branch	current sequences branch
location_index_match.max	6.660339	6.278257
location_index_match.mean	0.2038874168	0.217992223999999
location_index_match.min	0.001316	0.001381
location_index_match.sum	1019.437084	1089.96112
map_match.max	1554.185032	1651.807253
map_match.mean	242.08705704	243.51960916
map_match.min	3.57688	3.188165
map_match.sum	24208.705704	24351.960916
measurement.count	5000	5000
measurement.seed	123	123
measurement.time	32932	33115
measurement.totalMB	1524	1460
measurement.usedMB	42	35

Those are from the measurement class, but still not completely robust (e.g. might be off due to competing resource, so really should be run repeatedly etc.) but give a reasonable indication that nothing's really changed (which is as I'd expect).

the renaming should be discussed (necessary? better names?)

Very open to feedback on that. I've tried to use names that represent their purpose, to make it easier for others to understand the code base.

stefanholder commented 7 years ago

I've not checked it closely yet, maybe @stefanholder did?

No, I didn't look at it closely yet either.

kodonnell commented 7 years ago

What about the remaining non-checked items?

They're yet to be completed. Are you happy with the sequence-specific stuff of this PR (logic and API)? If so, I'll work on those additional items.

stefanholder commented 7 years ago

Are you happy with the sequence-specific stuff of this PR (logic and API)?

I'm happy with it. But since the API has changed a lot, I think it's important that @karussell also has a look.

kodonnell commented 7 years ago

Great. I'll review your other comments, and work on the remaining items. When that's done - yes, I think it'd be good to have @karussell review.

kodonnell commented 7 years ago

Wow, getting the GUI to handle multiple paths was much more painful than I thought. Anyway:

That's using this gpx. I also had to tweak locationIndexTree.findNClosest to make it not always return a result (within tiles but outside radius) - otherwise I couldn't break it into sequences. (See here.)

Yes!!

As mentioned, the GUI thus far is hacky (and not fully-functioning) - I almost think it'd be better to rewrite from scratch than to continue hacking around it. However, then I'd risk making breaking changes (i.e. features I don't understand) ... so I'm not sure.

The unit tests are probably more important - I discovered some bugs even just testing this.

We're starting to build our house, so this is becoming less of a priority for me. I might still do it, but my I shouldn't make promises I can't keep ...

karussell commented 7 years ago

Cool, thanks a lot :) !

We're starting to build our house, so this is becoming less of a priority for me

Sure :) !

INRIX-Trang-Nguyen commented 6 years ago

Did these changes make into the latest code base? We're finding very few matches when inspecting GPXExensions of match result. We have hundreds of input GPX entries but GPXExtension contains in many cases only a single entry.

kodonnell commented 6 years ago

@INRIX-Trang-Nguyen not that I know of. I'm closing this as I have no intention of actively working on it.

graphhopper / map-matching

WIP: handling viterbi breaks as multiple sequences #87

Objective

Questions