Closed philbooth closed 7 years ago
Taking the WIP
off this and calling it ready for review.
It does two things:
Refactors common code from import_activity_events.py
and import_flow_events.py
in to import_events.py
.
Adds a new flow_experiments
table (and downsampled versions thereof).
(there's also another change to tighten up the cleaning script, but hopefully that's not too contentious)
Here's some sample experiment data:
fxa=# select experiment, cohort, "timestamp", substring(flow_id, 0, 8) as flow_id, substring(uid, 0, 8) as uid, export_date
fxa-# from flow_experiments_sampled_10
fxa-# limit 3;
experiment | cohort | timestamp | flow_id | uid | export_date
----------------------+-----------+---------------------+---------+---------+-------------
connectAnotherDevice | treatment | 2017-03-24 23:59:17 | 1438838 | | 2017-03-25
connectAnotherDevice | treatment | 2017-03-24 23:58:28 | 68a209a | f87458a | 2017-03-25
connectAnotherDevice | treatment | 2017-03-24 23:59:06 | 31f273c | f87458a | 2017-03-25
(3 rows)
I'm interested in feedback on the new schema, the SQL and on the Python itself.
There's still scope for more refactoring in the future:
We can eliminate duplicated logic for the flow_metadata
and flow_experiments
data sets.
I'd like to invoke the summary script directly from import_activity_events.py
, so that we can eliminate the separate make summarize
step in cron.
I figured it made sense to get the changes here reviewed first before setting off down those roads though.
There were some question marks about import slowness with these changes, which I suspected was caused by excessive load from insane redash scheduling. In order to prove that, I set up a fresh redshift cluster with the same settings and ran the full import there. Within 10 hours, it had overtaken the import job on our current cluster, which had been running for 4 days at that point. It finished the entire import in less than two days (not sure of the exact figure because I didn't time
it and it finished while I was asleep). As things stand, the main import still has about 90 days left to import and we are approaching the 5-day mark.
Sadly, the redash scheduler is still going nuts, rendering our main cluster mostly useless at the moment:
I wonder, would it be okay to keep my "other" redshift cluster running to act as a kind of fast replica for dev/debugging work? Waiting for stuff to finish on the main cluster is pretty frustrating at the moment. It's called fxa-pb-proof
, in case anyone wants to play with it (but let's not schedule anything from redash to run there 😄).
Oh, I forgot to mention the third thing it does! It also down-samples the flow data, whoops! 😊
from mtg: @rfk to take a look at this
Fixes #61.
Not opening this for comment yet (although comments are still welcome of course), more to raise visibility. This is currently deployed to the redshift helper so if the imports suddenly fail there, it's likely because of something I've done wrong here.As I write this, the only change here is to refactorimport_activity_events.py
so that most of the code is pulled out to a genericimport_events.py
. Next step is to use the refactored code to add automatic expiration to the flow import.Obviously, everything related toflow_metadata
will remain specific to the flow import, which may mean looping twice, first through the CSVs and then through the imported days inflow_events
. Or more likely, I'll pass some kind of command function in toimport_events.py
so it can return control to the calling script on each iteration.Anyway, the point is I'm expecting what's here to evolve rapidly.