feat(scripts): Add email events import script

vbudhram commented 7 years ago

This PR adds the script to import email events into redshift.

philbooth commented 7 years ago

@vbudhram, code looks good!

I'm taking the liberty of running an import now to try it out, hope that's cool with you. You're welcome to DROP those tables and re-import yourself later, if you want to get a feel for how it runs / what the logs look like / how long it takes / whatever.

And of course, if you have any questions about running the import jobs, just shout. (the readme for this repo has some useful info but I just noticed it needs to be brought up to date a bit, will do that at some point).

philbooth commented 7 years ago

I'm taking the liberty of running an import now to try it out...

...aaaand it revealed a problem in import_events.py. I'll open a PR to fix that separately.

philbooth commented 7 years ago

@vbudhram, an error message that you'll become quite familiar with if you spend any length of time importing data to redshift, is this one:

Traceback (most recent call last):
  File "import_email_events.py", line 21, in <module>
    perm_columns=COLUMNS)
  File "/home/ec2-user/fxa-activity-metrics/import_events.py", line 233, in run
    import_day(day)
  File "/home/ec2-user/fxa-activity-metrics/import_events.py", line 179, in import_day
    **CONFIG))
  File "/usr/lib/python2.7/site-packages/postgres/__init__.py", line 374, in run
    cursor.run(sql, parameters)
  File "/usr/lib/python2.7/site-packages/postgres/cursors.py", line 92, in run
    self.execute(sql, parameters)
  File "/usr/lib64/python2.7/site-packages/psycopg2/extras.py", line 288, in execute
    return super(NamedTupleCursor, self).execute(query, vars)
psycopg2.InternalError: Load into table 'temporary_raw_email_data' failed.  Check 'stl_load_errors' system table for details.

I got this just now, trial-running the email event import.

When you encounter it, you want to query that table it mentions, stl_load_errors, to see what the problem is. The error information in there is pretty useful in my experience. You can either query it from the redshift console, or I also have a redash query set up that takes a date as the query param, e.g.:

https://sql.telemetry.mozilla.org/queries/1753/source?p_date=2017-04-13

(the date is the date the error occurred on, not the date of the file being imported)

In this case, the error we have is Extra column(s) found on the first line of email-events-2017-04-12.csv, which is the first file that the import tried to load (it runs in reverse chronological order).

So the problem here is that we need the columns in the temp_schema and temp_columns arguments to exactly match what is in the CSV.

Here, your code has six columns (five explicit plus the implied timestamp column), whereas the CSV contains eight. We need to define all eight columns in your script. Also, the order of temp_columns needs to exactly match the order of columns in the CSV. The easiest way to see all the CSV column names is in the lua script (it can be hard to find a line with every column populated in the CSV itself):

https://github.com/mozilla-services/puppet-config/blob/master/shared/modules/fxa/files/hekad/lua_outputs/fxa_email_csv.lua#L9-L16

So, all of that was a very long-winded way of saying we probably want temp_columns to be "flow_id, domain, template, type, bounced, complaint, locale" and then temp_schema needs to have the same columns (but not necessarily in the same order there).

Make sense?

philbooth commented 7 years ago

And sorry, I realise I've failed to document any of that. When I update the readme here, all this will have to go in.

vbudhram commented 7 years ago

@philbooth Nope that makes sense. I didn't include those fields because I didn't think they were needed in the table. Will update!

vbudhram commented 7 years ago

@philbooth Tested import script and wrote some queries for it, https://sql.telemetry.mozilla.org/queries/4236. Looks like it is working as expected. r?

vbudhram commented 7 years ago

@rfk Since phil is out mind an r? We would also have to hook this up to cron.

rfk commented 7 years ago

@vbudhram looks like the table schema changed when you swapped "bounce" and "complaint", should we drop the tables and re-import when this merges to master to ensure we get clean data?

vbudhram commented 7 years ago

should we drop the tables and re-import when this merges to master to ensure we get clean data

Done!

mozilla / fxa-activity-metrics

feat(scripts): Add email events import script #70