possible QA metrics - Githubissues

kpwebb commented 7 years ago

Based on conversations with @drewda and @dnesbitt61 I'm outlining a few potential top-level metrics to consider including in the QA rig.

1) GPS trace linear distance vs matched segment length. This test can be used with real and synthetic traces to detect significant under matching/overmatching of GPS to segments. Can be used as a crude first-pass check.

2) Incorrectly matched (overmatched) segments as % of total, by segment count and segment length.

3) Unmatched segments as % of total, by segment count and segment length.

4) Matched segments that exceed GPS trace speed by threshold x (e.g. where matched speed is 2x GPS trace speed), by segment count and by length. Can be used with real and synthetic traces.

5) Matched segments that are slower than GPS trace speed by threshold x (e.g. where matched speed is 1/2 GPS trace speed), by segment count and by length. Can be used with real and synthetic traces.

6) Mean and distribution of speed matches, by segment count and by length. Assuming synthetic traces have a constant speed this can be used to measure variance in speed detection over the length of trace.

mxndrwgrdnr commented 7 years ago

I updated the scoring metric to count each unmatched segment as a miss, even if its been missed already (e.g. if the route loops around the block). The results look similar to previous findings but more closely mimic the phenomenon identified in the Newson and Krumm paper where higher sampling rates produce poorer results at high levels of noise. This trend inverts around 40-60 m of noise. score_vs_noise_by_sample_rate

kpwebb commented 7 years ago

@mxndrwgrdnr this looks great.

How are things looking re measuring in terms of matched distance, rather than just segments?

I'm increasingly convinced that the failure case we're looking for is when a mostly good GPS trace falls apart temporarily (signal loss, etc.) and the match jumps way off. Does the distance spike in those cases trying to find a realistic match?

Raises two things about metrics: 1) distance of matched trace matters, 2) are there ways to think about perturbation of GPS that don't just mess up the whole trace, but rather degrade them periodically (would need to think about what GPS failure modes look like but maybe possible to get an idea from real-world traces).

mxndrwgrdnr commented 7 years ago

@kpwebb Distance traveled, and relatedly, speed, comparisons are still on the "to-do" list. I'm holding off until we've actually tuned the map-matching HMM using the segment-match-based metric, which should happen at some point next week. Also worth keeping in mind that the speed and distance-based metrics will be significantly impacted by the inclusion of time in the HMM, which is still on the docket. Any distance/speed-based scoring generated now won't necessarily reflect the performance of the finished product, although it will give us a good idea of where we're starting from. In any event, I will have something for you to look at next week.

In the meantime I will keep thinking about the different failure modes of GPS as I agree that's a good way of producing more realistic traces.

mxndrwgrdnr commented 7 years ago

Might need to pass reporter-generated segments back to valhalla trace_attributes in order to do length/distance-traveled comparison. The code is already doing this for the sake of route visualizations but I'm currently not saving the rest of the output, which we'd need in order to compare the relevant attributes.

mxndrwgrdnr commented 7 years ago

Implemented distance traveled-based scoring metric based on the method used in the Newton and Krumm paper: screen shot 2017-06-27 at 11 14 39 am .

The results are a near mirror-image of the segment-based matching: scores_vs_noise_by_sample_rate

mxndrwgrdnr commented 7 years ago

All distance-based metrics: match_errors_by_sample_rate

The top row of plots is comprised of composite metrics of both under- and overmatches (i.e. false negatives and false positives). The left column are count-based scores, and the right column are distance-based. They all track each other nicely, at least in the test region (San Francisco Bay Area).

One noticeable pattern that sticks out to me is that the "undermatches" appear to be more sensitive to sample rate at lower noise levels, while overmatches exhibit greater differentiation at higher levels of noise. Also of note is that the inversion in match quality mentioned above, whereby higher sample rates produce worse matches, is more pronounced for undermatches (false negatives).

mxndrwgrdnr commented 7 years ago

I've been exploring different metrics for speed-based matching and I think I've arrived at a useful result. The graph below shows two CDF curves, one for successfully matched segments (red) and one for incorrectly matched segments (blue) for the % error of GPS-derived speed relative to OSM speeds ((GPS speed - OSM speed) / OSM speed). The results suggest a definitive breakpoint for a threshold above which we would throw out the most erroneous matches while retaining the most correct ones. In your post above, @kpwebb, you suggested 2x as a threshold, and the graph certainly supports the notion that any derived/measured/observed speed above 2x the OSM speed is going to be a true negative. However, it also suggests that we'd still be getting a ton of false positives at this threshold (about 70% of them). At least for this region, the SF Bay Area, we could drop that threshold down to 37% above the OSM speed which would allow us to retain > 90% of our good matches while discarding 60% of the false positives. Even more conservative would be threshold around 15% which would retain almost 80% of true positives while rejecting over 70% of the false positives. Its worth noting, too, that this plot represents results from simulated GPS data at all sample rates and all noise levels. The threshold could be easily customized depending on sample rate and expected positional accuracy. My next move will be to see how this threshold might vary along those lines, and to compare the trend across different regions. cum_freq_speed_error

nvkelso commented 7 years ago

💯 great work!

mxndrwgrdnr commented 7 years ago

Distance-based QA metrics are available as a Python function here. Speed-based are here. And the wrapper function that iterates over a number of routes and performs the calculations is here. Functions for generating the metric plots as seen in the validation notebook can be found here. The plots themselves are featured in sections 3. and 6. of the validation notebook.

opentraffic / reporter-quality-testing-rig

possible QA metrics #2