Closed vbudhram closed 7 years ago
@vbudhram, code looks good!
I'm taking the liberty of running an import now to try it out, hope that's cool with you. You're welcome to DROP
those tables and re-import yourself later, if you want to get a feel for how it runs / what the logs look like / how long it takes / whatever.
And of course, if you have any questions about running the import jobs, just shout. (the readme for this repo has some useful info but I just noticed it needs to be brought up to date a bit, will do that at some point).
I'm taking the liberty of running an import now to try it out...
...aaaand it revealed a problem in import_events.py
. I'll open a PR to fix that separately.
@vbudhram, an error message that you'll become quite familiar with if you spend any length of time importing data to redshift, is this one:
Traceback (most recent call last):
File "import_email_events.py", line 21, in <module>
perm_columns=COLUMNS)
File "/home/ec2-user/fxa-activity-metrics/import_events.py", line 233, in run
import_day(day)
File "/home/ec2-user/fxa-activity-metrics/import_events.py", line 179, in import_day
**CONFIG))
File "/usr/lib/python2.7/site-packages/postgres/__init__.py", line 374, in run
cursor.run(sql, parameters)
File "/usr/lib/python2.7/site-packages/postgres/cursors.py", line 92, in run
self.execute(sql, parameters)
File "/usr/lib64/python2.7/site-packages/psycopg2/extras.py", line 288, in execute
return super(NamedTupleCursor, self).execute(query, vars)
psycopg2.InternalError: Load into table 'temporary_raw_email_data' failed. Check 'stl_load_errors' system table for details.
I got this just now, trial-running the email event import.
When you encounter it, you want to query that table it mentions, stl_load_errors
, to see what the problem is. The error information in there is pretty useful in my experience. You can either query it from the redshift console, or I also have a redash query set up that takes a date as the query param, e.g.:
https://sql.telemetry.mozilla.org/queries/1753/source?p_date=2017-04-13
(the date is the date the error occurred on, not the date of the file being imported)
In this case, the error we have is Extra column(s) found
on the first line of email-events-2017-04-12.csv
, which is the first file that the import tried to load (it runs in reverse chronological order).
So the problem here is that we need the columns in the temp_schema
and temp_columns
arguments to exactly match what is in the CSV.
Here, your code has six columns (five explicit plus the implied timestamp
column), whereas the CSV contains eight. We need to define all eight columns in your script. Also, the order of temp_columns
needs to exactly match the order of columns in the CSV. The easiest way to see all the CSV column names is in the lua script (it can be hard to find a line with every column populated in the CSV itself):
So, all of that was a very long-winded way of saying we probably want temp_columns
to be "flow_id, domain, template, type, bounced, complaint, locale"
and then temp_schema
needs to have the same columns (but not necessarily in the same order there).
Make sense?
And sorry, I realise I've failed to document any of that. When I update the readme here, all this will have to go in.
@philbooth Nope that makes sense. I didn't include those fields because I didn't think they were needed in the table. Will update!
@philbooth Tested import script and wrote some queries for it, https://sql.telemetry.mozilla.org/queries/4236. Looks like it is working as expected. r?
@rfk Since phil is out mind an r? We would also have to hook this up to cron.
@vbudhram looks like the table schema changed when you swapped "bounce" and "complaint", should we drop the tables and re-import when this merges to master to ensure we get clean data?
should we drop the tables and re-import when this merges to master to ensure we get clean data
Done!
This PR adds the script to import email events into redshift.