rebecca-burwei commented 1 year ago

Ad click rate

For the Google Trending experiment, we need this metric: ad click rate = (sum of ad clicks over all clients) / (sum of SAPs over all clients)

This is a PR to allow Mozanalysis (and eventually Jetstream) to compute this metric.

This PR changes Mozanalysis from truncating outliers to capping outliers.

I suggest this change because truncating outliers doesn't make sense when your metric involves multiple columns in a dataframe. For example, if we truncate outliers, then we have to answer questions like:

if a client has an outlier value for searches, but not ad clicks, should we keep it?
if a client has an outlier value for ad clicks, but not searches, should we keep it?

Capping avoids these problems (we keep all clients) and has been shown to have good theoretical properties when used with a mean: see Lugosi & Mendelson. (In the paper, they call it a "trimmed mean" instead of "capped mean" and they show that, as an estimator of the actual mean, the trimmed mean converges to the actual mean for heavy-tailed distributions as fast as possible / no slower than any other estimator.)

It is possible that a truncated mean also has the same good theoretical properties... I just haven't read anything that proves it.

One advantage of truncation is that truncation may model our revenue more closely. For example, if outliers come from bots, we probably don't get paid for their searches and ad clicks, so truncation would more closely align the search metrics with revenue.

There is also this analysis of 100 A/A tests where truncating and capping give similar results and capping is recommended.

I think the question of if we should change our defaults is open for debate.

bochocki commented 1 year ago

This is really cool! I didn't realize that you could cap heavy-tailed distributions and still get good estimates of the mean, but it makes sense.

danielkberry commented 1 year ago

Ok so I like the idea of swapping from truncating to capping, but I'm wondering how this change would impact in-flight experiments. If we merged this without doing anything, it would leave experiments in an "inconsistent" state (results already calculated would use the truncating while later results would use capping).

I can think of 3 things we could do and I'd love to get @scholtzan and @mikewilli 's thoughts here.

We could merge this change without changing Jetstream. This would leave results inconsistent, but would be the least effort
We could manually rerun all live experiments to update to the new methodology. This would be expensive both in effort (someone has to do it) and in bigquery costs.
We could prioritize development on the "versioning the metrics layer" idea where each experiment gets annotated with the version of mozanalysis, jetstream, and metric-hub that were live when it started. Then, we could decide to use the new version of mozanalysis only for experiments that launch after it's live. This would be expensive to build, but would solve this problem for future updates as well.

scholtzan commented 1 year ago

Ideally we'd want 3., but the implementation might take some time. Likely longer than we'd want to wait to get this merged. (I'll open a follow-up issue to get started on this and bump the priority). I guess in the meantime we'd likely want to do a mix of 1. and 2.. Experiments that are less important (if such exist) might not need to be rerun. For those where we need the new results we could rerun them. Reruns can be triggered for specific analysis periods, so maybe just rerunning for weekly would be enough here?

danielkberry commented 1 year ago

Reruns can be triggered for specific analysis periods, so maybe just rerunning for weekly would be enough here?

That's a good call. The two daily metrics (retained and unenroll) use the Binomial statistic which, AFAICT, doesn't do any filtering. So it's really Weekly and Overall metrics that would be impacted, and only Weekly could end up inconsistent.

mikewilli commented 1 year ago

I started typing a response (below) and was interrupted, but summary is that I agree with @scholtzan.

(3) would be great, but isn't on the current priorities list as far as I know. Even if it were bumped up it wouldn't be available for a little while at least. Selectively running experiments (and specific analysis periods) would increase the cost in effort but decrease the direct costs, but I think it's a reasonable tradeoff here.

I don't like (1) because this opens us up to confusion and seems like we're just kicking the can down the road. Maybe we can do this in the short term via some version pinning but without a semver style versioning scheme to indicate breaking changes this could be accidentally ignored and break things later.

rebecca-burwei commented 1 year ago

Experiments that are less important (if such exist) might not need to be rerun.

A lot of the live "experiments" have end dates that have already passed. @danielkberry , do you think this an indication that we wouldn't have to update them?

Do we need to update rollouts or single-branch Nimbus deployments? I know of 6 deployments that don't need updating because they are not experiments.

danielkberry commented 1 year ago

@rebecca-burwei I'm not sure I'm following. Whenever we merge this change, there would be live experiments that would need to be updated. Right now, there are 26 live experiments. But yes, rollouts would not need to be updated (since Jetstream does not run for rollouts).

It sounds like we're aligning around option 2 right now. I think it should be straightforward to write a script to grab all live experiments, and kick off a job to rerun weekly results on argo.

So the plan would be:

Merge this PR
Update Jetstream to use the new version
Run the script & monitor to make sure it finishes correctly. If it doesn't, take manual action to fix any experiments.

@scholtzan @mikewilli , how does that sound?

mikewilli commented 1 year ago

Yep, that sounds right to me.

scholtzan commented 1 year ago

Yes, sounds good

rebecca-burwei commented 1 year ago

Great - thanks everyone. Can we set a date to do this? @danielkberry @mikewilli @scholtzan The experiment that needs this metric got pushed to July 11, so we have some time, but of course, that date will creep up on us sooner than we expect if we don't set deadlines. :)

scholtzan commented 1 year ago

We could merge it today (or once Daniel gives it a review). Or any time really. Updating jetstream and rerunning the live experiments will be a relatively quick thing to do.

scholtzan commented 1 year ago

Rerunning the experiments can be made cheaper and quicker by only rerunning for weekly (which is the period ad_clicks are defined for)

mozilla / mozanalysis

Ad click rate #179

Ad click rate

This PR changes Mozanalysis from truncating outliers to capping outliers.