Sampled datasets appear to be closer to 11% and 51%

irrationalagent commented 5 years ago

Got a request for data going back as far as possible. Using some cached results in redshift its actually possible to cobble together a time series of sampled data stretching all the way back to 2015.

Unfortunately, when comparing our unsampled data to our sampled datasets I was reminded of the fact that the sampled versions, when multiplied by the appropriate factor (x10 or x2), overestimate the data by a very consistent factor. I neglected to file a bug earlier so here I am.

The business part of sampling appears to be done around here: https://github.com/mozilla/fxa-activity-metrics/blob/4141223dfab61e4c34fb18204a6219800f41903b/import_events.py#L113-L117

Is the problem here that there is a 0 cohort that gets included in the comparison above, e.g. you end up with cohorts 0-10 or 0-50, which is 11 cohorts? In other words, line 117 should really be WHERE cohort < {percent}.

I'm filing this issue but I'm not sure we should fix it because it would cause some data weirdness. Maybe we should just rename the datasets to *_sampled_11 and *_sampled_51. You would then need to multiply the results by approximately 9.0909 and 1.9608, respectively, if you want a more accurate estimate of the true volume.

CC @davismtl I believe this to be one source of the mismatch between re:dash and amplitude that we've observed when trying to get a long-run sense of YoY changes.

philbooth commented 5 years ago

Oh, good catch. It's the WHERE cohort <= {percent} part, the <= should be <.

philbooth commented 5 years ago

I'm filing this issue but I'm not sure we should fix it because it would cause some data weirdness.

Right, makes sense.

Maybe we should just rename the datasets to *_sampled_11 and *_sampled_51.

Yep, happy to do this instead, I'll need to liaise with @jbuck re: pausing the import, then doing the ALTER TABLE, then deploying edited import scripts to use the new names, then un-pausing the import.

irrationalagent commented 5 years ago

Yep, happy to do this instead, I'll need to liaise with @jbuck re: pausing the import, then doing the ALTER TABLE, then deploying edited import scripts to use the new names, then un-pausing the import.

Yea, I think we should do this. Other people may be using this data going forward so it would be best not to confuse them. Obviously not a "needs to happen right now" sort of thing though.

irrationalagent commented 5 years ago

CC @jmccrosky I believe this affects you - fxa sampled_10 and sampled_50 are really sampled_11 and sampled_51. sorry if this has caused you inconvenience. this has been true forever and will continue to be, so you may need to update your queries if you are multiplying by 10/2.

jbuck commented 5 years ago

We could fix the import scripts with the less than fix, then re-import again too

irrationalagent commented 5 years ago

@jbuck any sense of how costly would that be? it would have the advantage of not having to change past queries that assume 10/50 (though there aren't that many of them that we really care about)

jbuck commented 5 years ago

We're currently using 25% of disk space on the RS nodes, so we wouldn't need to scale that at all. I think it'd just be time we'd need to pay, and I think the last time I ran this it took about a week to do the full 2 years

irrationalagent commented 5 years ago

Would re-running the import salvage any of the data from our previous two incidents in Dec and Late Jan (viewable here) or are those logs gone gone (one was due to the disk write issue, so not too optimistic there)?

If they are salvageable, I would support re-running the import with the < fix at the same time to kill two birds. If not, let's hold off on the re-import for now and just re-name the tables.

philbooth commented 5 years ago

Would re-running the import salvage any of the data from our previous two incidents in Dec and Late Jan (viewable here) or are those logs gone gone (one was due to the disk write issue, so not too optimistic there)?

My memory is sketchy but I think one of those gaps (December?) was just caused by transient import errors when @jbuck migrated the import scripts to fxa-admin, so should be salvageable. And I thought the disk write thing only affected Amplitude, so maybe both are?

jbuck commented 5 years ago

@philbooth do you want to prep two patches? One for the table renames, and one for the sampling fix? We can roll with the table renames while the re-import runs in a separate database

philbooth commented 5 years ago

Yes! Sorry, been meaning to get round to that, doing it right now.

philbooth commented 5 years ago

Re-opening until the actual import is done.

jbuck commented 5 years ago

Okay, I've started the re-import with the new docker image. I'll update this issue on a daily basis until it's done

jbuck commented 5 years ago

It's updating in a separate database, so it won't be visible in Redash until I swap the two when it's done

jbuck commented 5 years ago

Imported activity events up to 2018-06-14

jbuck commented 5 years ago

Imported activity events up to 2017-07-20

vladikoff commented 5 years ago

from mtg: open until backfill is done

jbuck commented 5 years ago

imported all activity events up to 2017-02-24 (All done!) now running vacuum on activity events

jbuck commented 5 years ago

all activity events up to date imported all flow events up to 2019-01-28 (2019-02-03 had some data quality issues that I needed to fix manually)

jbuck commented 5 years ago

imported all flow events up to 2018-11-21

jbuck commented 5 years ago

imported all flow events up to 2018-10-06

jbuck commented 5 years ago

imported all flow events up to 2018-03-27

jbuck commented 5 years ago

imported all flow events up to 2017-12-27

jbuck commented 5 years ago

imported all flow events up to 2017-06-24

philbooth commented 5 years ago

@jbuck, can we close this issue now?

irrationalagent commented 5 years ago

@jbuck I'm only seeing activity_events up to 2019-02-11

philbooth commented 5 years ago

Closing this, everything looks good now.

mozilla / fxa-activity-metrics

Sampled datasets appear to be closer to 11% and 51% #129