jimmoffitt / json2csv

A Ruby app for converting Tweet JSON to CSV.
9 stars 2 forks source link
csv-export json-data markup ruby translate

json2csv

JSON to CSV is a one-way street...

Introduction

The 'json2csv' tool manages the conversion of Twitter enriched native (EN) and Gnip Activity Stream (AS) JSON to the comma separated values (CSV) format. Tweet attributes of interest are indicated by referencing a Tweet Template of choice. If the Tweet Template has an attribute it will be written to the output CSV files. If the Template does not have the attribute, it is dropped and not written. You can design your own Tweet Template, or use one of the provided example Templates.

This tool pulls JSON Tweets from an input folder and attempts to convert all .json and .json.gz files it finds there, writing the resulting CSV files to an output folder. This tool works with Activity Stream Tweet JSON produced with Gnip Full-Archive Search, 30-Day Search, and Historical PowerTrack. This tool was designed to convert JSON Tweets in bulk, and retains the JSON filename, e.g. MyTweets.json --> MyTweets.csv.

The json2csv tool is configured with a single YAML file and provides basic logging. This tool is written in Ruby and references a few basic gems (json, csv, and logging).

One of the first steps is to 'design' (or choose from our examples) a Tweet Template which identifies all the Tweet attributes that you are interested in. The conversion process uses this template and creates a CSV file with a column for every attribute in the template. The conversion process represents an opportunity to 'tune' what you want to export. For example, the standard Twitter metadata includes the numeric character position of hashtags in a tweet message. You may decide that you do not need this information, and therefore can omit those details from your Tweet template.

Before deciding to perform this type of conversion, you should consider the following trade-offs:

  1. JSON data is multi-dimensional, with multiple levels of nested data. However, CSVs are two dimensional. Converting from JSON to CSV means that you are sacrificing detail and flexibility in the data by either flattening it, or discarding some fields from the data.
  2. If you are consuming the data into a custom app, retaining the data in JSON provides a level of flexibility not available with CSVs. For example, field order is not important in JSON, where column order in CSVs is very important (and therefore, arguably more fragile).
  3. It is recommended to save the source JSON. It represents all the available metadata. You may decide to go back and convert metadata you did not include the first time.
Some things this tool does not do

Getting Started

Installing tool

Configuring json2csv

This tool defaults to ./config/config.yaml for its configuration/settings file. Below are all of the default settings. If the configured directories do not exist they will be automatically created.

json2csv:
  activity_template: ./templates/tweet_standard.json
  inbox: ./inbox
  outbox: ./outbox
  save_json: true
  processed_box: ./input/processed
  compress_csv: false #TODO: conventions? retain compression?

  arrays_to_collapse: hashtags,user_mentions,twitter_entities.urls,gnip.urls,matching_rules,topics
  header_overrides: actor.location.objectType,actor.location.displayName

header_mappings: {}

logging:
  name: json2csv.log
  log_path: ./log/
  warn_level: info
  size: 10 #MB
  keep: 2

Tweet Templates 

A Tweet Template is an example tweet payload in JSON that contains all the fields you want to export to the CSV files. Social activities, such as tweets, are dynamic in nature and the payloads from one tweet to another are sure to be different. One could be a geo-tagged tweet with several hashtags and mentions, while the next one is a retweet with an URL.

This example tweet is referred to as the conversion 'tweet template.' The conversion process loads this template and then tours each of your historical tweets and exports all metadata that is specified in the template. Here is a short example template that would export the bare minimum of metadata:

{
  "id": "tag:search.twitter.com,2005:418130988250570752",
  "actor": {
    "id": "id:twitter.com:17200003",
    "preferredUsername": "jimmoffitt",
  },
  "verb": "post",
  "postedTime": "2013-12-31T21:26:10.000Z",
  "body": "Example Tweet #HashTag1 #HashTag2"
   "twitter_entities": {
    "hashtags": [
      {"text": "HashTag1"},
      {"text": "HashTag2"}
    ]
}

This would be represented in a CSV file as:

id,actor.id,preferredUsername,verb,postedTime,body,hashtags
418130988250570752,17200003,jimmoffitt,post,2013-12-31T21:26:10.000Z,Example Tweet #HashTag1 #HashTag2,"HashTag1,HashTag2"

A couple things to note about this JSON to CSV conversion:

Example Tweet Templates

It can be difficult and time-consuming to find just the perfect Tweet 'in the wild', an actual Tweet that encapsulates all metadata you care about. So you may need to 'hand build' your own template Tweet. The means assembling an JSON object by picking and choosing the fields you want and copying them into a JSON file. When doing this, keep the following details in mind:

Here are several pre-built examples:

Use-case Examples

Details, Details, Details

Is there any special parsing of JSON values?

Yes, Tweet and Actor IDs are handled specially. For these IDs, the string component is stripped off and only the numeric part is retained:

id, actor.id
418130988250570752,17200003
How are CSV column names determined?

CSV column names are generated by referencing the JSON object names and using dot notation to indicate levels and arrays of attributes. For example, here are some JSON attributes:

{
    "id": "tag:search.twitter.com,2005:418130988250570752",
    "actor": {
        "id": "id:twitter.com:17200003"
    },
    "twitter_entities": {
        "hashtags": [
            {
                "text": "HappyNewYear",
            }
        ]
    }
}

Using dot notation, these attributes would be represented with the following header:

id, actor.id, twitter_entities.hashtags[0].text
Is there any special parsing of JSON values?

Yes, Tweet and Actor IDs are handled specially. For these IDs, the string component is stripped off and only the numeric part is retained. So this JSON...

{
    "id": "tag:search.twitter.com,2005:418130988250570752",
    "actor": {
        "id": "id:twitter.com:17200003"
    }
}

...gets converted to this CSV:

id, actor.id
418130988250570752,17200003
What if I want special names for certain columns?

With dot notation and the many levels of JSON attributes the auto-generated names can become quite long. Therefore you may want to override these defaults with shortened versions. Since CSV header/column names are auto-generated from the JSON structure, the auto-generated header for hashtags is 'twitter_entities.hashtags.0.text'. Since a more convenient name would be simply 'hashtags', there is support for overriding default names. The conversion process already has some 're-mappings' built-in, and others can be added when wanted.

Here are some examples:

twitter_entities.hashtags.0.text               --> hashtags
twitter_entities.urls.0.expanded_url           --> twitter_expanded_urls
twitter_entities.user_mentions.0.screen_name   --> user_mention_screen_names
gnip.matching_rules.0.value                    --> rule_values

The following Ruby method from the ./lib/config.rb AppConfig class encapsulates all the 'header overrides'. These default overrides are stored in a hash of (key, value) pairs, where the key is the header that is 'organically' generated by as the JSON attributes are converted to dot notation names, and the value is what is instead used in the CSV header.

If you want to extend this hash, you can either edit the generate_special_header_mappings method, or configure them in the config.yaml configuration file under the header_mappings: section:

   header_mappings: {"actor.summary": "profile_bio"}
def generate_special_header_mappings

        mappings = Hash.new

        mappings['twitter_entities.hashtags.0.text'] = 'hashtags'
        mappings['twitter_entities.urls.0.url'] = 'twitter_urls'
        mappings['twitter_entities.urls.0.expanded_url'] = 'twitter_expanded_urls'
        mappings['twitter_entities.urls.0.display_url'] = 'twitter_display_urls'
        mappings['twitter_entities.user_mentions.0.screen_name'] = 'user_mention_screen_names'
        mappings['twitter_entities.user_mentions.0.name'] = 'user_mention_names'
        mappings['twitter_entities.user_mentions.0.id'] = 'user_mention_ids'
        mappings['gnip.matching_rules.0.value'] = 'rule_values'
        mappings['gnip.matching_rules.0.tag'] = 'rule_tags'
        mappings['gnip.language.value'] = 'gnip_lang'

        #Geographical metadata labels.
        mappings['location.geo.coordinates.0.0.0'] = 'box_sw_long'
        mappings['location.geo.coordinates.0.0.1'] = 'box_sw_lat'
        mappings['location.geo.coordinates.0.1.0'] = 'box_nw_long'
        mappings['location.geo.coordinates.0.1.1'] = 'box_nw_lat'
        mappings['location.geo.coordinates.0.2.0'] = 'box_ne_long'
        mappings['location.geo.coordinates.0.2.1'] = 'box_ne_lat'
        mappings['location.geo.coordinates.0.3.0'] = 'box_se_long'
        mappings['location.geo.coordinates.0.3.1'] = 'box_se_lat'
        mappings['geo.coordinates.0'] = 'point_long'
        mappings['geo.coordinates.1'] = 'point_lat'

        #These Klout topics need some help.
        mappings['gnip.klout_profile.topics.0.klout_topic_id'] = 'klout_topic_id'
        mappings['gnip.klout_profile.topics.0.display_name'] = 'klout_topic_name'
        mappings['gnip.klout_profile.topics.0.link'] = 'klout_topic_link'

        mappings
    end
How are arrays of metadata handled?

Twitter metadata include many attributes that have a variable length list of values. Examples include hashtags, user mentions, and URLs. These example metadata are stored in a JSON object named 'twitter_entities' and each attribute is an array of length zero or more. Arrays in CSV are stored with double-quotes around a comma-delimited list of values. So these arrays can be thought of as a list within another list.

Using the hashtag example, multiple hashtags would be inserted into a single column with the 'hashtags' column header (relying on the built-in header override discussed above):

id,actor.id,hashtags
418130988250570752,17200003,"HashTag1,HashTag2"

It depends on how many files are being processed, how many Tweets are being converted, and how many attributes are included in the template Tweet. If there are 10 million Tweets, and 200 Tweet attributes in the template, there are 2 billion attributes to process.

Using a [standard Tweet template] (https://github.com/jimmoffitt/json2csv/blob/master/templates/tweet_standard.json) approximately 5 million Tweets can be processed per hour. Massive datasets can take hours to process. I wonder how fast it would run if written in Python...

Some coding conventions/details...

File handling:

Logging: