opentraffic / reporter

OTv2: distributed service that matches raw GPS probe data to OSMLR segments and sends anonymized speeds to Datastore
GNU Lesser General Public License v3.0
13 stars 16 forks source link

Emission Heuristics to Maximize Ability to Measure Variance in Duration #86

Closed kevinkreiser closed 7 years ago

kevinkreiser commented 7 years ago

Currently the reporter if it gets 5 observations for a given segment-next-segment pair it averages those all together and reports that when its time. This means that we lose some of the ability to measure variance unless we get some observations for this pair later on in wall time (but for the same point in gps time). So what we'll want to do is not just average all the measurements together. We'll want to at the point when we go to emit these measurements group them in such a way as to still be able to measure variance but also not skew the averages.

Say you have 5 observations for a given segment-next-segment pair. You have your privacy setting to 2 which means you have enough data to emit these observations in some form. Today we average all of these into one measurement with a count of 5. But to preserve the ability to measure variance we should probably emit 2 measurements, one with a count of 2 and one with a count of 3. We need an heuristic to do that though. Lets say of the 5 observations we have durations: 10, 12, 20, 25, 65

How do we group these observations so that we most accurately represent the data?

dnesbitt61 commented 7 years ago

Why not just emit a histogram? A vector of count/duration?

kevinkreiser commented 7 years ago

@dnesbitt61 because then non-anonymised data would leave the reporter, for example when the count is 1 for a given slot in the histogram

kevinkreiser commented 7 years ago

just to be clear i think things would be vastly simplified if we do what @dnesbitt61 is suggesting, so yeah i hope when we get clarification the answer is make it so :smile: