The International Working Conference on Mining Software Repositories (MSR) has hosted a mining challenge since 2006. With this challenge we call upon everyone interested to apply their tools to bring research and industry closer together by analyzing a common data set. The challenge is for researchers and practitioners who bravely put their mining tools and approaches on a dare.
In 2014, the challenge was on the GitHub data. We provide the data for the GitHub repository and you should use your brain, tools, computational power, and magic to uncover interesting findings related to it.
Download this for the openscience.us

The dataset contains 90 GitHub projects and their folks that are not randomly selected and thus not representative of GitHub.
Versions
After the initial release of the dataset, the users found errors and missing features. The list of versions along with the fixes is presented in the table below. Only the latest version is offered for download.
You are advised to always run queries against the newest version. If you have already downloaded an older version and the described fix does not affect your experiment, you could skip the update.
The MSR 2014 challenge dataset is a (very) trimmed down version of the original GHTorrent dataset. It includes data from the top-10 starred software projects for the top programming languages on Github, which gives 90 projects and their forks. For each project, we retrieved all data including issues, pull requests organizations, followers, stars and labels (milestones and events not included). The dataset was constructed from scratch to ensure the latest information is in it.
Similarly to GHTorrent itself, the MSR challenge dataset comes in two flavours:
A MongoDB database dump containing the results of querying the Github API. See format here.
A MySQL database dump containing a queriable version of important fields extracted from the raw data. See schema here.
The included projects are the following:
Importing and using
The following instructions assume an OSX or Linux based host.
$ wget http://ghtorrent.org/downloads/msr14-mysql.gz
$ mysql -u root -p
mysql > create user 'msr14'@'localhost' identified by 'msr14';
mysql> create database msr14;
mysql> GRANT ALL PRIVILEGES ON msr14.* to msr14@'localhost';
mysql> flush privileges;
# Exit MySQL prompt
$ zcat msr14-mysql.gz |mysql -u msr14 -p msr14
$ mysql -u msr14 -p msr14
mysql> select language,count(*) from projects where forked_from is null group by language;
+------------+----------+
| language | count(*) |
+------------+----------+
| C | 10 |
| C# | 8 |
| C++ | 8 |
| CSS | 3 |
| Go | 1 |
| Java | 8 |
| JavaScript | 9 |
| PHP | 9 |
| Python | 10 |
| R | 4 |
| Ruby | 10 |
| Scala | 9 |
| TypeScript | 1 |
+------------+----------+
13 rows in set (0.01 sec)
FAQ
Answers to frequently asked questions
Why a new dataset?
For practical reasons. The dataset is small enough to be used on a laptop, yet rich enough to do really interesting research with it.
The MSR folks report that they have succesfully used these dumps into a 2011 MacBookAir with 4GB of RAM. Mileage may vary with the MySQL data dump, the hardware requirements are low.
Challenge Data
This year, the focus of the challenge is the GitHub data. GitHub is a web-based service providing a collaborative software development environment and a social network for developers. We provide you with the dataset extracted from GHTorrent by Georgios Gousios.
When you use the data provided by the MSR 2014 Challenge, we ask you to cite it as following:
@inproceedings{Gousi13,
author = {Gousios, Georgios},
title = {The GHTorrent dataset and tool suite},
booktitle = {Proceedings of the 10th Working Conference on Mining Software Repositories},
series = {MSR'13},
year = {2013},
isbn = {978-1-4673-2936-1},
location = {San Francisco, CA, USA},
pages = {233--236},
numpages = {4},
url = {http://dl.acm.org/citation.cfm?id=2487085.2487132}
}
Acknowledgments
We would like to thank Georgios Gousios from the Delft University of Technology for providing GHTorrent data.
A user has a unique user name or email. May contain artificially
generated user names, see commits below.
There are two types of users, USERs and ORGanizations. The
second type defines a collection of users.
organization_members
Users that are members of an organization.
The created_at field is only filled in accurately for memberships
for which GHTorrent has recorded a corresponding event. Otherwise,
it is filled in with the latest date that the corresponding user or
organization has been created.
projects
Information about repositories. A repository is always owned by a user.
The forked_from field is empty unless the project is a fork in
which case it contains the id of the project the project is forked
from.
The deleted field means that the project has been deleted from
Github.
project_members
Users that have commit access to the repository.
The created_at field is only filled in accurately for memberships for
which GHTorrent has recorded a corresponding event. Otherwise, it is
filled in with the latest date that the corresponding user or project
has been created.
commits
Unique commits.
Each commit is identified globally through its sha field. If the
author or the committer has not configured his Github email
address,
no resolution to a User enty is possible. In that case, GHTorrent
generates artificial users using the provided email in the Git
commit author or committer fields. If the user then configures his
Github account, GHTorrent will update the artificial user
accordingly.
The project_id field contains a link to the project that this
commit has been first associated with. This might not be the project
this commit was initially pushed to, e.g. in case the fork is
processed before the parent. See
project_commits.
The project_id field may be null when the repository has been
deleted at the time the commit is processed. This situation might
happen when retrospectively processing pull requests for a
repository and the repository which the pull request originates from
has been deleted.
commit_parents
The parent commit(s) for each commit, as specified by Git.
project_commits
The commits belonging to the history of a project.
More than one projects can share the same commits if one is a fork of
the other.
commit_comments
Code review comments on commits.
These are comments on individual commits. If a commit is associated with
a pull request, then its comments are in the
pull_request_comments table.
followers
A follower to a user.
The created_at field is only filled in accurately for followships for
which GHTorrent has recorded a corresponding event. Otherwise, it is
filled in with the latest date that the corresponding user or follower
has been created.
The created_at field is only filled in accurately for starrings for
which GHTorrent has recorded a corresponding event. Otherwise, it is
filled in with the latest date that the corresponding user or project
has been created.
pull_requests
A pull request initiated from head_repo_id:head_commit_id to
base_repo_id:base_commit_id
Pull requests can be in various states. The states and their
transitions are recorded in the
pull_request_history
table.
The pullreq_id field is Github's pull request unique identifier
The intra_branch field signifies that the head and base
repositories are the same
If the head repository is NULL, this means that the corresponding
project had been deleted when GHTorrent processed the pull request.
pull_request_history
An event in the pull request lifetime
The action field can take the following values
opened: When the pull request has been opened
closed: When the pull request has been closed
merged: When Github detected that the pull request has been
merged. No merges outside Github (i.e. Git based) are reported
reoponed: When a pull request is opened after being closed
syncrhonize: When new commits are added/removed to the head
repository
pull_request_commits
A commit associated with a pull request
The list is additive. This means if a rebase with commit squashing takes
place after the commits of a pull request have been processed, the old
commits will not be deleted.
pull_request_comments
A code review comment on a commit associated with a pull request
The list is additive. If commits are squashed on the head repo, the
comments remain intact.
issues
An issue associated with a repository
The assignee field is filed in with the user to which the issue
was assigned at the time the issue was processed.
Issues have history recorded in the
issue_events table.
For every pull request, GHTorrent creates a corresponding issue. The
pull_request_id field points to the associated pull request
The issue_id field is the unique identifier given to the issue by
Github.
issue_events
An event on an issue
The action field can have the following values:
subscribed: When a user subscribes to receive notifications
about the issue.
mentioned: When a user is mentioned by another user (@user
notation)
closed: When the issue has been closed
referenced: The issue was referenced in a commit (using the
fixes:
conventions)
assigned: When the issue has been assigned to an actor.
reopened: When a closed issue is reopened
unsubscribed: When a user unsubscribed from issue.
merged: When the pull request pointed by the issue has been
merged.
head_ref_cleaned: (Not documented) ?
head_ref_deleted: (Not documented) When the branch of the head
repository has been deleted
head_ref_restored: (Not documented) When the head repository
of a pull request has been restored (using the restore branch
functionality).
The action_specific field gets filled in with the commit\_id of
the last commit when a pull request has been closed, merged or
referenced.
issue_comments
An entry to the issue discussion. This table is always filled in with
pull request (or issue) discussion comments, irrespective of whether the
repo has issues enabled or not.
repo_labels
A label to be assigned to an issue affecting this repository.
issue_labels
A label that has been assigned to an issue
Example queries
List commits for a repository
select c.*
from commits c, project_commits pc, projects p, users u
where u.login = 'rails'
and p.name = 'rails'
and p.id = pc.project_id
and c.id = pc.commit_id
order by c.created_at desc
Get all actions for a pull request
select user, action, created_at from
(
select prh.action as action, prh.created_at as created_at, u.login as user
from pull_request_history prh, users u
where prh.pull_request_id = ?
and prh.actor_id = u.id
union
select ie.action as action, ie.created_at as created_at, u.login as user
from issues i, issue_events ie, users u
where ie.issue_id = i.id
and i.pull_request_id = ?
and ie.actor_id = u.id
union
select 'discussed' as action, ic.created_at as created_at, u.login as user
from issues i, issue_comments ic, users u
where ic.issue_id = i.id
and u.id = ic.user_id
and i.pull_request_id = ?
union
select 'reviewed' as action, prc.created_at as created_at, u.login as user
from pull_request_comments prc, users u
where prc.user_id = u.id
and prc.pull_request_id = ?
) as actions
order by created_at;
Get participants in an issue or pull request
select distinct(user_id) from
(
select user_id
from pull_request_comments
where pull_request_id = ?
union
select user_id
from issue_comments ic, issues i
where i.id = ic.issue_id and i.pull_request_id = ?
) as participants
data set: 106MB
MSR'14 Mining Challenge
The International Working Conference on Mining Software Repositories (MSR) has hosted a mining challenge since 2006. With this challenge we call upon everyone interested to apply their tools to bring research and industry closer together by analyzing a common data set. The challenge is for researchers and practitioners who bravely put their mining tools and approaches on a dare.
In 2014, the challenge was on the GitHub data. We provide the data for the GitHub repository and you should use your brain, tools, computational power, and magic to uncover interesting findings related to it.
URL
Download the data (last updated on October 10, 2013 to include commit comments), available in two forms - MySQL dump (106 MB) from http://ghtorrent.org/downloads/msr14-mysql.gz.
Download this for the openscience.us  The dataset contains 90 GitHub projects and their folks that are not randomly selected and thus not representative of GitHub.
Versions After the initial release of the dataset, the users found errors and missing features. The list of versions along with the fixes is presented in the table below. Only the latest version is offered for download.
You are advised to always run queries against the newest version. If you have already downloaded an older version and the described fix does not affect your experiment, you could skip the update.
The MSR 2014 challenge dataset is a (very) trimmed down version of the original GHTorrent dataset. It includes data from the top-10 starred software projects for the top programming languages on Github, which gives 90 projects and their forks. For each project, we retrieved all data including issues, pull requests organizations, followers, stars and labels (milestones and events not included). The dataset was constructed from scratch to ensure the latest information is in it.
Similarly to GHTorrent itself, the MSR challenge dataset comes in two flavours:
A MongoDB database dump containing the results of querying the Github API. See format here. A MySQL database dump containing a queriable version of important fields extracted from the raw data. See schema here. The included projects are the following:
Importing and using The following instructions assume an OSX or Linux based host.
FAQ Answers to frequently asked questions
Why a new dataset?
For practical reasons. The dataset is small enough to be used on a laptop, yet rich enough to do really interesting research with it.
The MSR folks report that they have succesfully used these dumps into a 2011 MacBookAir with 4GB of RAM. Mileage may vary with the MySQL data dump, the hardware requirements are low.
Challenge Data
This year, the focus of the challenge is the GitHub data. GitHub is a web-based service providing a collaborative software development environment and a social network for developers. We provide you with the dataset extracted from GHTorrent by Georgios Gousios.
When you use the data provided by the MSR 2014 Challenge, we ask you to cite it as following:
Acknowledgments We would like to thank Georgios Gousios from the Delft University of Technology for providing GHTorrent data.
The relational DB schema
Download Download PDF
Entities and their relationships
users
Github users.
type
s of users,USER
s andORG
anizations. The second type defines a collection of users.organization_members
Users that are members of an organization.
created_at
field is only filled in accurately for memberships for which GHTorrent has recorded a corresponding event. Otherwise, it is filled in with the latest date that the corresponding user or organization has been created.projects
Information about repositories. A repository is always owned by a user.
forked_from
field is empty unless the project is a fork in which case it contains theid
of the project the project is forked from.deleted
field means that the project has been deleted from Github.project_members
Users that have commit access to the repository.
The
created_at
field is only filled in accurately for memberships for which GHTorrent has recorded a corresponding event. Otherwise, it is filled in with the latest date that the corresponding user or project has been created.commits
Unique commits.
sha
field. If the author or the committer has not configured his Github email address, no resolution to a User enty is possible. In that case, GHTorrent generates artificial users using the provided email in the Git commit author or committer fields. If the user then configures his Github account, GHTorrent will update the artificial user accordingly.project_id
field contains a link to the project that this commit has been first associated with. This might not be the project this commit was initially pushed to, e.g. in case the fork is processed before the parent. See project_commits.project_id
field may be null when the repository has been deleted at the time the commit is processed. This situation might happen when retrospectively processing pull requests for a repository and the repository which the pull request originates from has been deleted.commit_parents
The parent commit(s) for each commit, as specified by Git.
project_commits
The commits belonging to the history of a project.
More than one projects can share the same commits if one is a fork of the other.
commit_comments
Code review comments on commits.
These are comments on individual commits. If a commit is associated with a pull request, then its comments are in the pull_request_comments table.
followers
A follower to a user.
The
created_at
field is only filled in accurately for followships for which GHTorrent has recorded a corresponding event. Otherwise, it is filled in with the latest date that the corresponding user or follower has been created.watchers
Users that have starred (was watched) a project
The
created_at
field is only filled in accurately for starrings for which GHTorrent has recorded a corresponding event. Otherwise, it is filled in with the latest date that the corresponding user or project has been created.pull_requests
A pull request initiated from
head_repo_id
:head_commit_id
tobase_repo_id
:base_commit_id
pullreq_id
field is Github's pull request unique identifierintra_branch
field signifies that the head and base repositories are the samepull_request_history
An event in the pull request lifetime
The
action
field can take the following valuesopened
: When the pull request has been openedclosed
: When the pull request has been closedmerged
: When Github detected that the pull request has been merged. No merges outside Github (i.e. Git based) are reportedreoponed
: When a pull request is opened after being closedsyncrhonize
: When new commits are added/removed to the head repositorypull_request_commits
A commit associated with a pull request
The list is additive. This means if a rebase with commit squashing takes place after the commits of a pull request have been processed, the old commits will not be deleted.
pull_request_comments
A code review comment on a commit associated with a pull request
The list is additive. If commits are squashed on the head repo, the comments remain intact.
issues
An issue associated with a repository
assignee
field is filed in with the user to which the issue was assigned at the time the issue was processed.pull_request_id
field points to the associated pull requestissue_id
field is the unique identifier given to the issue by Github.issue_events
An event on an issue
action
field can have the following values:subscribed
: When a user subscribes to receive notifications about the issue.mentioned
: When a user is mentioned by another user (@user notation)closed
: When the issue has been closedreferenced
: The issue was referenced in a commit (using the fixes: conventions)assigned
: When the issue has been assigned to an actor.reopened
: When a closed issue is reopenedunsubscribed
: When a user unsubscribed from issue.merged
: When the pull request pointed by the issue has been merged.head_ref_cleaned
: (Not documented) ?head_ref_deleted
: (Not documented) When the branch of the head repository has been deletedhead_ref_restored
: (Not documented) When the head repository of a pull request has been restored (using the restore branch functionality).action_specific
field gets filled in with thecommit\_id
of the last commit when a pull request has been closed, merged or referenced.issue_comments
An entry to the issue discussion. This table is always filled in with pull request (or issue) discussion comments, irrespective of whether the repo has issues enabled or not.
repo_labels
A label to be assigned to an issue affecting this repository.
issue_labels
A label that has been assigned to an issue
Example queries
List commits for a repository
Get all actions for a pull request
Get participants in an issue or pull request