Deciding what gets included in the archived traffic statistics

opentraffic / architecture

OTv1 overview

70 stars 11 forks source link

Deciding what gets included in the archived traffic statistics #7

Open Holly-Transport opened 9 years ago

Holly-Transport commented 9 years ago

In addition to storing the average travel time by road segment and by time period, it would be very useful if we could find a way to also include the number of observations associated with the average travel times -- both for the purposes of establishing the reliability of the results, as well as for use in other applications that may rely on such data. Including observations makes the pool more valuable. Of course, if there is only one data contributor in a given region, this may impinge on their commercial data security concerns.

Thus, a technical challenge may be posed. Would it possible to make the # of observations accessible only in in cases where there are at least two operators covering roughly the same geographic area?

mattwigway commented 9 years ago

Additionally, we probably want to store statistics by time of day and day of week. Even if we were just storing an average, we need to store the number of observations internally in order to continue updating that average. Additionally, we want old observations to become less and less relevant over time. This might be a good place to apply a Bayesian method, using the previous estimate as a prior and the observation as the likelihood. Otherwise, we might need to store every observation, anonymized in some form.

mattwigway commented 9 years ago

@kpwebb points out they used OLAP cubes before. This is a sensible approach.

laurentg commented 9 years ago

To help anonymization, there is probably no need to store the contextual "path" information for each segment, ie the preceding / following speed profile. This may not be enough, but that should make more difficult rebuilding whole paths from data.

Thinking about it, this contextual path information could theoretically be helpful for getting more precise data, for example helping in computing intersection turns. In fact this last point may need a bit more discussion, I may open a new issue to discuss that.

bmander commented 9 years ago

I'm seeing talk about a lot of situations in which we want to be able to provide data products that are anonymized, but are made from non-anonymized datasets. For example: rolling-average statistics, turn restrictions, recasting histograms into different axes, context-dependent speed calculations, and so on. Considering that we're proposing to create a formal organization, I propose that one of the roles of the organization is to maintain the security of slightly-less-than-anonymized data in order to retain the flexibility to innovate in the creation of fully anonymized data products at a later time.