Biased average values - Githubissues

nmalkin commented 12 years ago

Report #1 presents the average number of sites a user logs into with Persona (as of #29, it is the mean, not the median). The value is computed as the mean of all values of the number_sites_signed_in KPI across all data points on a given day.

Problem

Consider this sequence of operations by a single user:

No-op (user opens dialog, maybe enters email, but doesn't succeed at logging in)
No-op
No-op
Log in to site 1
Log in to site 2
Log in to site 3
Log in to site 4

To compute the mean value, we will take the sum of all values for number_sites_signed_in (0+0+0+1+2+3=6) and divide by the total number of data points (6) to get a mean value of 1, while the correct value is, of course, 4.

In general, the problem is that multiple interactions by a single user are treated as equivalent to a single interaction by multiple users.

One way to account for this would be to try to aggregate the data points by user (i.e., figure out which data points came from the same person) and use only the maximum value. However, this is costly and has undesirable privacy implications.

Another way to handle it would be to weight higher values of number_sites_signed_in (e.g., 2=2×1, ...). This is equivalent to saying, "oh, I just saw a 2. That means I also saw a 1, but that 1 shouldn't count." This a more sensible approach, but note that it wouldn't fully correct the bias in the example above; nor (for the same reason) can it account for repeated sign-ins to the same site.

One more possibility is to do nothing, since we keep saying that this is not a very meaningful metric and we only care about its derivative. This is obviously the easiest, though we would probably have to stop calling it "average number of sites logged in."

nmalkin commented 12 years ago

@ozten, I feel like I brought this issue up at some point, but I don't remember if we decided anything about it.

nmalkin commented 12 years ago

@ozten suggests keeping the metric as-is, but giving it a "cartoony" name, like "Persona adoption," to emphasize that the value itself is meaningless (or, more charitably, inaccurate).

nmalkin commented 12 years ago

The report has been renamed as discussed (with a clarification message added to the description). If more drastic action is required, reopen this issue.

nmalkin / kpi-dashboard

Biased average values #32

Problem