The 'json2csv' tool manages the conversion of Twitter enriched native (EN) and Gnip Activity Stream (AS) JSON to the comma separated values (CSV) format. Tweet attributes of interest are indicated by referencing a Tweet Template of choice. If the Tweet Template has an attribute it will be written to the output CSV files. If the Template does not have the attribute, it is dropped and not written. You can design your own Tweet Template, or use one of the provided example Templates.
This tool pulls JSON Tweets from an input folder and attempts to convert all .json and .json.gz files it finds there, writing the resulting CSV files to an output folder. This tool works with Activity Stream Tweet JSON produced with Gnip Full-Archive Search, 30-Day Search, and Historical PowerTrack. This tool was designed to convert JSON Tweets in bulk, and retains the JSON filename, e.g. MyTweets.json --> MyTweets.csv.
The json2csv tool is configured with a single YAML file and provides basic logging. This tool is written in Ruby and references a few basic gems (json, csv, and logging).
One of the first steps is to 'design' (or choose from our examples) a Tweet Template which identifies all the Tweet attributes that you are interested in. The conversion process uses this template and creates a CSV file with a column for every attribute in the template. The conversion process represents an opportunity to 'tune' what you want to export. For example, the standard Twitter metadata includes the numeric character position of hashtags in a tweet message. You may decide that you do not need this information, and therefore can omit those details from your Tweet template.
Before deciding to perform this type of conversion, you should consider the following trade-offs:
This tool defaults to ./config/config.yaml for its configuration/settings file. Below are all of the default settings. If the configured directories do not exist they will be automatically created.
json2csv:
activity_template: ./templates/tweet_standard.json
inbox: ./inbox
outbox: ./outbox
save_json: true
processed_box: ./input/processed
compress_csv: false #TODO: conventions? retain compression?
arrays_to_collapse: hashtags,user_mentions,twitter_entities.urls,gnip.urls,matching_rules,topics
header_overrides: actor.location.objectType,actor.location.displayName
header_mappings: {}
logging:
name: json2csv.log
log_path: ./log/
warn_level: info
size: 10 #MB
keep: 2
A Tweet Template is an example tweet payload in JSON that contains all the fields you want to export to the CSV files. Social activities, such as tweets, are dynamic in nature and the payloads from one tweet to another are sure to be different. One could be a geo-tagged tweet with several hashtags and mentions, while the next one is a retweet with an URL.
This example tweet is referred to as the conversion 'tweet template.' The conversion process loads this template and then tours each of your historical tweets and exports all metadata that is specified in the template. Here is a short example template that would export the bare minimum of metadata:
{ "id": "tag:search.twitter.com,2005:418130988250570752", "actor": { "id": "id:twitter.com:17200003", "preferredUsername": "jimmoffitt", }, "verb": "post", "postedTime": "2013-12-31T21:26:10.000Z", "body": "Example Tweet #HashTag1 #HashTag2" "twitter_entities": { "hashtags": [ {"text": "HashTag1"}, {"text": "HashTag2"} ] }
This would be represented in a CSV file as:
id,actor.id,preferredUsername,verb,postedTime,body,hashtags 418130988250570752,17200003,jimmoffitt,post,2013-12-31T21:26:10.000Z,Example Tweet #HashTag1 #HashTag2,"HashTag1,HashTag2"
A couple things to note about this JSON to CSV conversion:
It can be difficult and time-consuming to find just the perfect Tweet 'in the wild', an actual Tweet that encapsulates all metadata you care about. So you may need to 'hand build' your own template Tweet. The means assembling an JSON object by picking and choosing the fields you want and copying them into a JSON file. When doing this, keep the following details in mind:
gnip.matching_rules.0.value != gnip.matching_rules.value
The matching_rules attribute is an array so gnip.matching_rules.0.value
will match, while gnip.matching_rules.value
will not.
Here are several pre-built examples:
Yes, Tweet and Actor IDs are handled specially. For these IDs, the string component is stripped off and only the numeric part is retained:
id, actor.id 418130988250570752,17200003
CSV column names are generated by referencing the JSON object names and using dot notation to indicate levels and arrays of attributes. For example, here are some JSON attributes:
{ "id": "tag:search.twitter.com,2005:418130988250570752", "actor": { "id": "id:twitter.com:17200003" }, "twitter_entities": { "hashtags": [ { "text": "HappyNewYear", } ] } }
Using dot notation, these attributes would be represented with the following header:
id, actor.id, twitter_entities.hashtags[0].text
Yes, Tweet and Actor IDs are handled specially. For these IDs, the string component is stripped off and only the numeric part is retained. So this JSON...
{ "id": "tag:search.twitter.com,2005:418130988250570752", "actor": { "id": "id:twitter.com:17200003" } }
...gets converted to this CSV:
id, actor.id 418130988250570752,17200003
With dot notation and the many levels of JSON attributes the auto-generated names can become quite long. Therefore you may want to override these defaults with shortened versions. Since CSV header/column names are auto-generated from the JSON structure, the auto-generated header for hashtags is 'twitter_entities.hashtags.0.text'. Since a more convenient name would be simply 'hashtags', there is support for overriding default names. The conversion process already has some 're-mappings' built-in, and others can be added when wanted.
Here are some examples:
twitter_entities.hashtags.0.text --> hashtags twitter_entities.urls.0.expanded_url --> twitter_expanded_urls twitter_entities.user_mentions.0.screen_name --> user_mention_screen_names gnip.matching_rules.0.value --> rule_values
The following Ruby method from the ./lib/config.rb AppConfig class encapsulates all the 'header overrides'. These default overrides are stored in a hash of (key, value) pairs, where the key is the header that is 'organically' generated by as the JSON attributes are converted to dot notation names, and the value is what is instead used in the CSV header.
If you want to extend this hash, you can either edit the generate_special_header_mappings
method, or configure
them in the config.yaml
configuration file under the header_mappings:
section:
header_mappings: {"actor.summary": "profile_bio"}
def generate_special_header_mappings
mappings = Hash.new
mappings['twitter_entities.hashtags.0.text'] = 'hashtags'
mappings['twitter_entities.urls.0.url'] = 'twitter_urls'
mappings['twitter_entities.urls.0.expanded_url'] = 'twitter_expanded_urls'
mappings['twitter_entities.urls.0.display_url'] = 'twitter_display_urls'
mappings['twitter_entities.user_mentions.0.screen_name'] = 'user_mention_screen_names'
mappings['twitter_entities.user_mentions.0.name'] = 'user_mention_names'
mappings['twitter_entities.user_mentions.0.id'] = 'user_mention_ids'
mappings['gnip.matching_rules.0.value'] = 'rule_values'
mappings['gnip.matching_rules.0.tag'] = 'rule_tags'
mappings['gnip.language.value'] = 'gnip_lang'
#Geographical metadata labels.
mappings['location.geo.coordinates.0.0.0'] = 'box_sw_long'
mappings['location.geo.coordinates.0.0.1'] = 'box_sw_lat'
mappings['location.geo.coordinates.0.1.0'] = 'box_nw_long'
mappings['location.geo.coordinates.0.1.1'] = 'box_nw_lat'
mappings['location.geo.coordinates.0.2.0'] = 'box_ne_long'
mappings['location.geo.coordinates.0.2.1'] = 'box_ne_lat'
mappings['location.geo.coordinates.0.3.0'] = 'box_se_long'
mappings['location.geo.coordinates.0.3.1'] = 'box_se_lat'
mappings['geo.coordinates.0'] = 'point_long'
mappings['geo.coordinates.1'] = 'point_lat'
#These Klout topics need some help.
mappings['gnip.klout_profile.topics.0.klout_topic_id'] = 'klout_topic_id'
mappings['gnip.klout_profile.topics.0.display_name'] = 'klout_topic_name'
mappings['gnip.klout_profile.topics.0.link'] = 'klout_topic_link'
mappings
end
Twitter metadata include many attributes that have a variable length list of values. Examples include hashtags, user mentions, and URLs. These example metadata are stored in a JSON object named 'twitter_entities' and each attribute is an array of length zero or more. Arrays in CSV are stored with double-quotes around a comma-delimited list of values. So these arrays can be thought of as a list within another list.
Using the hashtag example, multiple hashtags would be inserted into a single column with the 'hashtags' column header (relying on the built-in header override discussed above):
id,actor.id,hashtags 418130988250570752,17200003,"HashTag1,HashTag2"
It depends on how many files are being processed, how many Tweets are being converted, and how many attributes are included in the template Tweet. If there are 10 million Tweets, and 200 Tweet attributes in the template, there are 2 billion attributes to process.
Using a [standard Tweet template] (https://github.com/jimmoffitt/json2csv/blob/master/templates/tweet_standard.json) approximately 5 million Tweets can be processed per hour. Massive datasets can take hours to process. I wonder how fast it would run if written in Python...
File handling:
Logging: