Closed irrationalagent closed 5 years ago
Oh, good catch. It's the WHERE cohort <= {percent}
part, the <=
should be <
.
I'm filing this issue but I'm not sure we should fix it because it would cause some data weirdness.
Right, makes sense.
Maybe we should just rename the datasets to
*_sampled_11
and*_sampled_51
.
Yep, happy to do this instead, I'll need to liaise with @jbuck re: pausing the import, then doing the ALTER TABLE
, then deploying edited import scripts to use the new names, then un-pausing the import.
Yep, happy to do this instead, I'll need to liaise with @jbuck re: pausing the import, then doing the
ALTER TABLE
, then deploying edited import scripts to use the new names, then un-pausing the import.
Yea, I think we should do this. Other people may be using this data going forward so it would be best not to confuse them. Obviously not a "needs to happen right now" sort of thing though.
CC @jmccrosky I believe this affects you - fxa sampled_10 and sampled_50 are really sampled_11 and sampled_51. sorry if this has caused you inconvenience. this has been true forever and will continue to be, so you may need to update your queries if you are multiplying by 10/2.
We could fix the import scripts with the less than fix, then re-import again too
@jbuck any sense of how costly would that be? it would have the advantage of not having to change past queries that assume 10/50 (though there aren't that many of them that we really care about)
We're currently using 25% of disk space on the RS nodes, so we wouldn't need to scale that at all. I think it'd just be time we'd need to pay, and I think the last time I ran this it took about a week to do the full 2 years
Would re-running the import salvage any of the data from our previous two incidents in Dec and Late Jan (viewable here) or are those logs gone gone (one was due to the disk write issue, so not too optimistic there)?
If they are salvageable, I would support re-running the import with the < fix at the same time to kill two birds. If not, let's hold off on the re-import for now and just re-name the tables.
Would re-running the import salvage any of the data from our previous two incidents in Dec and Late Jan (viewable here) or are those logs gone gone (one was due to the disk write issue, so not too optimistic there)?
My memory is sketchy but I think one of those gaps (December?) was just caused by transient import errors when @jbuck migrated the import scripts to fxa-admin
, so should be salvageable. And I thought the disk write thing only affected Amplitude, so maybe both are?
@philbooth do you want to prep two patches? One for the table renames, and one for the sampling fix? We can roll with the table renames while the re-import runs in a separate database
Yes! Sorry, been meaning to get round to that, doing it right now.
Re-opening until the actual import is done.
Okay, I've started the re-import with the new docker image. I'll update this issue on a daily basis until it's done
It's updating in a separate database, so it won't be visible in Redash until I swap the two when it's done
Imported activity events up to 2018-06-14
Imported activity events up to 2017-07-20
from mtg: open until backfill is done
imported all activity events up to 2017-02-24 (All done!) now running vacuum on activity events
all activity events up to date imported all flow events up to 2019-01-28 (2019-02-03 had some data quality issues that I needed to fix manually)
imported all flow events up to 2018-11-21
imported all flow events up to 2018-10-06
imported all flow events up to 2018-03-27
imported all flow events up to 2017-12-27
imported all flow events up to 2017-06-24
@jbuck, can we close this issue now?
@jbuck I'm only seeing activity_events up to 2019-02-11
Closing this, everything looks good now.
Got a request for data going back as far as possible. Using some cached results in redshift its actually possible to cobble together a time series of sampled data stretching all the way back to 2015.
Unfortunately, when comparing our unsampled data to our sampled datasets I was reminded of the fact that the sampled versions, when multiplied by the appropriate factor (x10 or x2), overestimate the data by a very consistent factor. I neglected to file a bug earlier so here I am.
The business part of sampling appears to be done around here: https://github.com/mozilla/fxa-activity-metrics/blob/4141223dfab61e4c34fb18204a6219800f41903b/import_events.py#L113-L117
Is the problem here that there is a 0 cohort that gets included in the comparison above, e.g. you end up with cohorts 0-10 or 0-50, which is 11 cohorts? In other words, line 117 should really be
WHERE cohort < {percent}
.I'm filing this issue but I'm not sure we should fix it because it would cause some data weirdness. Maybe we should just rename the datasets to
*_sampled_11
and*_sampled_51
. You would then need to multiply the results by approximately 9.0909 and 1.9608, respectively, if you want a more accurate estimate of the true volume.CC @davismtl I believe this to be one source of the mismatch between re:dash and amplitude that we've observed when trying to get a long-run sense of YoY changes.