gousiosg / github-mirror

Scripts to mirror Github in a cloudy fashion
BSD 2-Clause "Simplified" License
559 stars 106 forks source link
ghtorrent github-api ruby

ghtorrent: Mirror and index data from the Github API

A library and a collection of scripts used to retrieve data from the Github API and extract metadata in an SQL database, in a modular and scalable manner. The scripts are distributed as a Gem (ghtorrent), but they can also be run by checking out this repository.

GHTorrent can be used for a variety of purposes, such as:

Components

GHTorrents components (which can be used individually) are:

Component Configuration

The Persister and GHTorrent components have configurable back ends:

For distributed mirroring you also need RabbitMQ >= 3.3

Installation

1. Install GHTorrent

GHTorrent is written in Ruby (tested with Ruby > 2.0). To install it as a Gem do:

sudo gem install ghtorrent

2. Install Your Preferred Database

Depending on which SQL database you want to use, install the appropriate dependency gem.

sudo gem install mysql2 # or sqlite3

Configuration

Copy config.yaml.tmpl to a file in your home directory.

All provided scripts accept the -c option, which accepts the location of the configuration file as a parameter.

You can find more information of how you can setup a mirroring cluster of machines to retrieve data in parallel on the Wiki.

Using GHTorrent

To mirror the event stream and capture all data:

To retrieve data for a repository or user:

To perform maintenance:

Data

The code in this repository is used to power the data collection process of the GHTorrent.org project. You can find all data collected by in the project in the Downloads page.

There are two sets of data:

Bugs & Feature Requests

Please tell us about features you'd like or bugs you've discovered on our Issue Tracker.

Patches, bug fixes, etc are welcome. Please fork the repository and create a pull request when done fixing/implementing the new feature.

Citing GHTorrent in your Research

If you find GHTorrent and the accompanying datasets useful in your research, please consider citing the following paper:

Georgios Gousios and Diomidis Spinellis, "GHTorrent: GitHub’s data from a firehose," in MSR '12: Proceedings of the 9th Working Conference on Mining Software Repositories, June 2-–3, 2012. Zurich, Switzerland.

Authors

License

2-clause BSD