Standardize dataset format

What is the optimal way to serve the dataset(s)?

The size of the Seattle dataset makes processing on my local machine pretty resource intensive. I'm also not sure how do network centrality calculations in a parallel/distributed way....

Here's the three approaches I see, but I haven't looked at Houston's data yet.

Standard (3.2GB unzipped)
- one csv with one email per row
- recipient columns are semi-colon separated lists
- include source file and department as columns
Optimized for networking:
- two csvs, one of nodes and one of edges
- node csv: message_id, datetime, souce_file, department
- edge csv columns: sender, recipient, field_type (to, cc, bcc), message_id
Optimized for speed:
- same as above
- but with a person csv with columns: person_id, person_name
- and edge table has only person_id cols, not character strings.

red-bin / metadata_grapher

Standardize dataset format #5