mozilla / fxa-activity-metrics

A server for managing the Firefox Accounts metrics database and pipeline
1 stars 3 forks source link

wip: break up the flow import for kinesis integration #100

Closed philbooth closed 6 years ago

philbooth commented 6 years ago

Related to #99. Not for merging, just for discussion / ongoing work.

I'm opening this for when @jbuck gets back from holiday, hopefully it is sufficient to unblock him on the mozlog 2 stuff. (I disappear off on a holiday of my own shortly after he gets back)

What I've done here is break apart the flow import into 2 separate scripts:

Something that we lose with these changes is straightforward error correction. Because the existing scripts do everything in atomic units of a day, it's easy to delete and re-import specific days in the event of something going wrong. But hopefully our shiny new pipeline will be so perfect that we don't have to worry about that too much. (this is why there are no export_date columns in the new schema)

The other side-effect of the loss of per-day semantics is that we now populate flow_metadata first, so that we can use MAX(begin_time) when drawing the line for data expiry. That seemed like the right thing to do, although maybe it's not that important really.

If anyone wants to muck about with these scripts for real, they're in ~/kinesis-flow-events on our redshift helper EC2 instance. There's also a week of data currently imported to kinesis_ prefixed table names in Redshift, which I imported (and tested) just to make sure the code actually works.

philbooth commented 6 years ago

Note that when we come to doing this for real, we'll need similar scripts for the activity and email events too. I figured it made sense to work everything out for one import first, and the flow import is the most complicated so seemed like the best test-candidate.

philbooth commented 6 years ago

Tangential fix alert.

While doing some other analysis I noticed that our import scripts currently store all of the raw flow.continued.{flow_id} events in the flow_events table. As well as wasting space, it makes the result set of SELECT DISTINCT type FROM flow_events really, really unwieldy.

Fixing it is easier in this PR's codebase than it is in master because of all the refactoring that's happened over there, so if we're ultimately ditching those scripts anyway I didn't see the point of doing it in master.

Hence 273a56f.

vladikoff commented 6 years ago

from mtg: closing