refactored import scripts

philbooth commented 7 years ago

Fixes #61.

~~Not opening this for comment yet (although comments are still welcome of course), more to raise visibility~~. This is currently deployed to the redshift helper so if the imports suddenly fail there, it's likely because of something I've done wrong here.

As I write this, the only change here is to refactor import_activity_events.py so that most of the code is pulled out to a generic import_events.py. Next step is to use the refactored code to add automatic expiration to the flow import.

Obviously, everything related to flow_metadata will remain specific to the flow import, which may mean looping twice, first through the CSVs and then through the imported days in flow_events. Or more likely, I'll pass some kind of command function in to import_events.py so it can return control to the calling script on each iteration.

Anyway, the point is I'm expecting what's here to evolve rapidly.

philbooth commented 7 years ago

Taking the WIP off this and calling it ready for review.

It does two things:

Refactors common code from import_activity_events.py and import_flow_events.py in to import_events.py.
Adds a new flow_experiments table (and downsampled versions thereof).

(there's also another change to tighten up the cleaning script, but hopefully that's not too contentious)

Here's some sample experiment data:

fxa=# select experiment, cohort, "timestamp", substring(flow_id, 0, 8) as flow_id, substring(uid, 0, 8) as uid, export_date
fxa-# from flow_experiments_sampled_10
fxa-# limit 3;
      experiment      |  cohort   |      timestamp      | flow_id |   uid   | export_date
----------------------+-----------+---------------------+---------+---------+-------------
 connectAnotherDevice | treatment | 2017-03-24 23:59:17 | 1438838 |         | 2017-03-25
 connectAnotherDevice | treatment | 2017-03-24 23:58:28 | 68a209a | f87458a | 2017-03-25
 connectAnotherDevice | treatment | 2017-03-24 23:59:06 | 31f273c | f87458a | 2017-03-25
(3 rows)

I'm interested in feedback on the new schema, the SQL and on the Python itself.

There's still scope for more refactoring in the future:

We can eliminate duplicated logic for the flow_metadata and flow_experiments data sets.
I'd like to invoke the summary script directly from import_activity_events.py, so that we can eliminate the separate make summarize step in cron.

I figured it made sense to get the changes here reviewed first before setting off down those roads though.

There were some question marks about import slowness with these changes, which I suspected was caused by excessive load from insane redash scheduling. In order to prove that, I set up a fresh redshift cluster with the same settings and ran the full import there. Within 10 hours, it had overtaken the import job on our current cluster, which had been running for 4 days at that point. It finished the entire import in less than two days (not sure of the exact figure because I didn't time it and it finished while I was asleep). As things stand, the main import still has about 90 days left to import and we are approaching the 5-day mark.

Sadly, the redash scheduler is still going nuts, rendering our main cluster mostly useless at the moment:

Screenshot showing duplicate queries spawned by the redash scheduler

I wonder, would it be okay to keep my "other" redshift cluster running to act as a kind of fast replica for dev/debugging work? Waiting for stuff to finish on the main cluster is pretty frustrating at the moment. It's called fxa-pb-proof, in case anyone wants to play with it (but let's not schedule anything from redash to run there 😄).

philbooth commented 7 years ago

Oh, I forgot to mention the third thing it does! It also down-samples the flow data, whoops! 😊

vladikoff commented 7 years ago

from mtg: @rfk to take a look at this

mozilla / fxa-activity-metrics

refactored import scripts #62