Closed philbooth closed 6 years ago
Note that when we come to doing this for real, we'll need similar scripts for the activity and email events too. I figured it made sense to work everything out for one import first, and the flow import is the most complicated so seemed like the best test-candidate.
Tangential fix alert.
While doing some other analysis I noticed that our import scripts currently store all of the raw flow.continued.{flow_id}
events in the flow_events
table. As well as wasting space, it makes the result set of SELECT DISTINCT type FROM flow_events
really, really unwieldy.
Fixing it is easier in this PR's codebase than it is in master
because of all the refactoring that's happened over there, so if we're ultimately ditching those scripts anyway I didn't see the point of doing it in master
.
Hence 273a56f.
from mtg: closing
Related to #99. Not for merging, just for discussion / ongoing work.
I'm opening this for when @jbuck gets back from holiday, hopefully it is sufficient to unblock him on the mozlog 2 stuff. (I disappear off on a holiday of my own shortly after he gets back)
What I've done here is break apart the flow import into 2 separate scripts:
kinesis_flow_events_1.py
, which you can mostly ignore. This one is just here so I can set up and populate our temporary / raw data table for the other script to read from. The work in this script will ultimately be done by Kinesis Firehose, when he have the real mozlog 2 pipeline up and running. It can also serve as a reference when setting up the Kinesis stuff, for what the Redshift schema needs to be and what the expected CSV format is.kinesis_flow_events_2.py
, which contains the good stuff. This script reads data from the temporary / raw data table and makes no assumptions about the length of time that table covers. For more about why we want to do this, see the discussion in #99.Something that we lose with these changes is straightforward error correction. Because the existing scripts do everything in atomic units of a day, it's easy to delete and re-import specific days in the event of something going wrong. But hopefully our shiny new pipeline will be so perfect that we don't have to worry about that too much. (this is why there are no
export_date
columns in the new schema)The other side-effect of the loss of per-day semantics is that we now populate
flow_metadata
first, so that we can useMAX(begin_time)
when drawing the line for data expiry. That seemed like the right thing to do, although maybe it's not that important really.If anyone wants to muck about with these scripts for real, they're in
~/kinesis-flow-events
on our redshift helper EC2 instance. There's also a week of data currently imported tokinesis_
prefixed table names in Redshift, which I imported (and tested) just to make sure the code actually works.