Data conversions and examples for generating reports from twarc collections using tools such as D3.js
These utilities accept a Twitter json file (as fetched by twarc), analyze it various ways, and output a json or csv file. The initial purpose is to feed data into D3.js for various visualizations, but the intention is to make the outputs generic enough to serve other uses as well. Each utility has a D3 example template, which it can use to generate a self-contained html file. It can also generate csv or json output, and there is a worked example of how to use csv in a pre-existing D3 chart.
The d3graph.py utility was originally added to the twarc repo as directed.py but is moving here for consistency.
All requirements may be installed with pip install -r requirements.txt
python-dateutil
pip install pytz
pip install tzlocal
pip install pysparklines
pip install requests_oauthlib
Install twarc
according to its instructions, i.e. with pip install twarc
.
Run twarc.py
once so
that it can ask for your access token etc. (see twarc's readme). Make sure that twarc-archive.py
is on the system path.
projects
subdirectory under twarc-reportprojects
, named appropriatelymetadata.json
and fill in the search you want to track./harvest.py projects/[yourproject]
to harvest your tweets (this may take some time - hours or days for very large searches)./reportprofile.py projects/[yourproject]
to see a summary of your harvest./harvest.py projects/[yourproject]
whenever you want to update your harvest.Note that only tweets from the last 7 days or so are available from Twitter at any given time, so be sure to update your harvest accordingly to avoid gaps.
twarc-report/ # local clone
projects/
assets/ # copy of twarc-report/assets/
projectA/
data/ # created by harvest.py
tweets/ # populated with tweet*.json files by harvest.py
metadata.json
timeline.html # generated by a twarc-report script
...
projectB/
...
Metadata about the project, including the search query, is kept in
metadata.json
. The metadata.json
file is created by the user and contains metadata
for the harvest. It should be in this form:
{"search": "#ferguson",
"title": "Ferguson Tweets",
"creator": "Peter Binkley"}
(Currently only the search
value is used but other metadata fields will
be used to populate HTML output in future releases.)
The harvested tweets
and other source data are stored in the data
subdirectory, with the
tweets going the tweets
directory. These directories are created by
harvest.py
if they don't exist.
Generated HTML files use relative paths like ../assets/d3.vs.min.js
to call
shared libraries from the assets
directory. They can be created in
the project directories (ProjectA
etc.). This
allows you to publish the output by syncing the project and assets
directories to a web server while exclusing the data
subdirectory. You
can also run python's SimpleHTTPServer in the projects
directory to
load examples you've created in the project directories:
python -m SimpleHTTPServer 8000
And then visit e.g. http://localhost:8000/ProjectA/projectA-timebar.html
.
The script harvest.py
will use twarc's twarc-archive.py
to start or update a harvest using a given
search and stored in a given directory. The directory path is passed as the only parameter:
./harvest.py projects/ProjectA
The search is read from the metadata.json
file, and tweets are stored
in data/tweets
.
Running reportprofiler.py
on a tweet collection with the flag -o text
will generate a summary
profile of the collection, with some basic stats (number of tweets, retweets, users, etc.) and some
possibly interesting sparklines.
Count: 25100
Users: 5779
User percentiles: █▂▁▁▁▁▁▁▁▁
[62, 12, 6, 5, 3, 2, 2, 2, 2, 2]
That indicates that the top 10 percent of users accounted for 62% of the tweets, while the bottom 10% accounted for 2% of the tweets. This will give a quick sense of whether the collection is dominated by a few voices or has broad participation. The profile also includes the top 10 users and top 10 shared urls, with similar sparklines.
Note: the sparklines are generated by pysparklines, using Unicode block characters. If they have an uneven baseline, it's the fault of the font. On a Mac, I find that Menlo Regular gives a good presentation in the terminal.
Some utilities to generate D3.js visualizations of aspects of a collection of tweets are provided. Use "--output=json" or "--output=csv" to output the data for use with other D3 examples, or "--help" for other options.
A directed graph of mentions or retweets, in which nodes are users and arrows point from the original user to the user who mentions or retweets them:
% d3graph.py --mode mentions projects/nasa > projects/nasa/nasa-directed-mentions.html
% d3graph.py --mode retweets projects/nasa > projects/nasa/nasa-directed-retweets.html
% d3graph.py --mode replies projects/nasa > projects/nasa/nasa-directed-replies.html
An undirected graph of co-occurring hashtags:
% d3cotag.py projects/nasa > projects/nasa/nasa-cotags.html
A threshold can be specified with "-t": hashtags whose number of occurrences falls below this will not be linked. Instead, if "-k" is set, they will be replaced with the pseudo-hashtag "-OTHER". Hashtags can be excluded with "-e" (takes a comma-delimited list). If the tweets were harvested by a search for a single hashtag then it's a good idea to exclude that tag, since every other tag will link to it.
A bar chart timeline with arbitrary intervals, here five minutes:
% d3times.py -a -t local -i 5M projects/nasa > projects/nasa/nasa-timebargraph.html
The output timezone is specified by "-t"; the interval is specified by "-i", using the standard abbreviations: seconds = S, minutes = M, hours = H, days = d, months = m, years = Y. The example above uses five-minute intervals. Output may be aggregated using "-a": each row has a time value and a count. Note that if you are generating the html example, you must use "-a".
An animated wordcloud, in which words are added and removed according to changes in frequency over time.
% d3wordcloud.py -t local -i 1H projects/nasa > projects/nasa/nasa-wordcloud.html
The optional "-t" control timezone and "-i" controls interval, as in d3timebar.py
. Start and end
timestamps may be set with "-s" and "-e".
This script calls a fork of Jason Davies' d3-cloud project. The forked version attempts to keep the carried-over words in transitions close to their previous position.
The json and csv outputs can be used to view your data in D3 example visualizations with minimal fuss. There are many many examples to be explored; Mike Bostock's Gallery is a good place to start. Here's a worked example, using Bostock's Zoomable Timeline Area Chart. It assumes no knowledge of D3.
First, look at the data input. In line 137 this example loads a csv file
d3.csv("flights-departed.csv", function(data) {
The csv file looks like this:
date,value
1988-01-01,12681
...
We can easily generate a csv file that matches that format:
% ./d3times.py -a -i 1d -o csv
(I.e. aggregate, one-day interval, output csv). We then just need to edit the output to make the column headers match the original csv, i.e. change them to "date,value".
We also need to check the way the example loads scripts and css assets, especially the D3 library. In this case it expects a local copy:
<script type="text/javascript" src="https://github.com/pbinkley/twarc-report/raw/master/d3/d3.js"></script>
<script type="text/javascript" src="https://github.com/pbinkley/twarc-report/raw/master/d3/d3.csv.js"></script>
<script type="text/javascript" src="https://github.com/pbinkley/twarc-report/raw/master/d3/d3.time.js"></script>
<link type="text/css" rel="stylesheet" href="https://github.com/pbinkley/twarc-report/blob/master/style.css"/>
Either change those links to point to the original location, or save a local copy. (Note that if you're going to put your example online you'll want local copies of scripts, since the same-origin policy will prevent them from being loaded from the source).
Once you've matched your data to the example and made sure it can load the D3.js library, the example may work. In this case it doesn't - it shows an empty chart. The title "U.S. Commercial Flights, 1999-2001" and the horizontal scale explain why: it expects dates within a certain (pre-Twitter) range, and the x domain is hard-coded accordingly. The setting is easy to find, in line 146:
x.domain([new Date(1999, 0, 1), new Date(2003, 0, 0)]);
Change those dates to include the date range of your data, and the example should work. Don't worry about matching your dates closely: the chart is zoomable, after all. Alternatively, you could borrow a snippet from the template timebar.html to set the domain to match the earliest and latest dates in your data:
x.domain([
d3.min(values, function(d) {return d.name}),
d3.max(values, function(d) {return d.name})
]);
A typical Twarc harvest gets you a few days worth of tweets, so the day-level display of this example probably isn't very interesting. We're not bound by the time format of the example, however. We can see it in line 63:
parse = d3.time.format("%Y-%m-%d").parse,
We can change that to parse at the minute interval: "%Y-%m-%d %H:%M", and generate our csv at the same interval with "-i 1M". With those changes we can zoom in until bars represent a minute's worth of tweets.
This example doesn't work perfectly: I see some odd artifacts around the bottom of the chart, as if the baseline were slightly above the x axis and small values are presented as negative. And it doesn't render in Chrome at all (Firefox and Safari are fine). The example is from 2011 and uses an older version of the D3 library, and with some tinkering it could probably be updated and made functional. It serves to demonstrate, though, that only small changes and no knowledge of the complexities of D3 are needed to fit your data into an existing D3 example.
The heart of twarc-report is the Profiler
class in profiler.py
. The
scripts pass json records from the twarc harvests to this class, and it
tabulates some basic properties: number of tweets and authors, earliest
and latest timestamp, etc. The scripts create their own profilers that
inherit from this class and that process the extra fields etc. needed by
the script. To add a new script, start by working out its profiler class
to collect the data it needs from each tweet in the process() method,
and to organize the output in the report() method.
The various output formats are generated by functions in d3output.py
.