Include a "prevalence" property in public data extracts

opentraffic / datastore

OTv2: centralized ingest and aggregation of anonymous traffic data

GNU Lesser General Public License v3.0

28 stars 12 forks source link

Datastore's internal histogram tile files store counts for the number of vehicles/observations within each bin. Public data extracts turn these accurate counts into a coarser "prevalence" property that can be shared publicly.

Goal is to share a measurement that can be used to tell the rough confidence of speed estimates on segments and to convey the approximate relative magnitude of vehicles on different segments -- but not to share counts that are so accurate that they can be used by competitors to understand a data-provider's business.

As a temporary place holder, we round to counts to the nearest 10's (see https://github.com/opentraffic/datastore/blob/308c8b48256359a0824210f16a7247962d4f87dd/scripts/make_speeds.py#L189-L190).

Let's consider better alternatives.

From @dnesbitt61:

I propose an integer scale from 1 to 10 - this will not give any false notion of precision. I think any count > 120 readings per hour should be 10. It seems to me that 2 probes per minute should indicate enough readings to have high confidence. Some other value might be used here, but the notion is above some threshold there are enough readings and we don't want to indicate absolute magnitude.

Maybe use sqrt of count to make the range non-linear. Not sure where to add any noise: to the initial count or to the value after sqrt?

def prevalence(count):
  if (count > 120) :
    return 10
  return int((sqrt(count))

opentraffic / datastore

Include a "prevalence" property in public data extracts #71