A library and a collection of scripts used to retrieve data from the Github API
and extract metadata in an SQL database, in a modular and scalable manner. The
scripts are distributed as a Gem (ghtorrent
), but they can also be run by
checking out this repository.
GHTorrent can be used for a variety of purposes, such as:
GHTorrents components (which can be used individually) are:
The Persister and GHTorrent components have configurable back ends:
mongo
driver) or no persister (noop
driver)For distributed mirroring you also need RabbitMQ >= 3.3
GHTorrent is written in Ruby (tested with Ruby > 2.0). To install it as a Gem do:
sudo gem install ghtorrent
Depending on which SQL database you want to use, install the appropriate dependency gem.
sudo gem install mysql2 # or sqlite3
Copy config.yaml.tmpl to a file in your home directory.
All provided scripts accept the -c
option, which accepts the location of the configuration file as
a parameter.
You can find more information of how you can setup a mirroring cluster of machines to retrieve data in parallel on the Wiki.
To mirror the event stream and capture all data:
ght-mirror-events.rb
periodically polls Github's event
queue (https://api.github.com/events
), stores all new events in the
configured pestister, and posts them to the github
exchange in
RabbitMQ.
ght-data_retrieval.rb
creates queues that route posted events to processor
functions. The functions use the appropriate Github API call to retrieve the
linked contents, extract metadata (for database storage), and store the
retrieved data in the appropriate collection in the persister, to avoid
duplicate API calls.
Data in the SQL database contain pointers (the ext_ref_id
field) to the
"raw" data in the persister.
To retrieve data for a repository or user:
ght-retrieve-repo
retrieves all data for a specific repositoryght-retrieve-user
retrieves all data for a specific userTo perform maintenance:
ght-load
loads selected events from the persister to the queue in order for
the ght-data-retrieval
script to reprocess themThe code in this repository is used to power the data collection process of the GHTorrent.org project. You can find all data collected by in the project in the Downloads page.
There are two sets of data:
ght-data-retrieval
crawler starts
from an event and goes deep into the rabbit hole.Please tell us about features you'd like or bugs you've discovered on our Issue Tracker.
Patches, bug fixes, etc are welcome. Please fork the repository and create a pull request when done fixing/implementing the new feature.
If you find GHTorrent and the accompanying datasets useful in your research, please consider citing the following paper:
Georgios Gousios and Diomidis Spinellis, "GHTorrent: GitHub’s data from a firehose," in MSR '12: Proceedings of the 9th Working Conference on Mining Software Repositories, June 2-–3, 2012. Zurich, Switzerland.