opensciences / opensciences.github.io

Website for OpenScience -
http://openscience.us
MIT License
26 stars 18 forks source link

MSR'14data challenge #99

Closed timm closed 9 years ago

timm commented 9 years ago

data set: 106MB

MSR'14 Mining Challenge

The International Working Conference on Mining Software Repositories (MSR) has hosted a mining challenge since 2006. With this challenge we call upon everyone interested to apply their tools to bring research and industry closer together by analyzing a common data set. The challenge is for researchers and practitioners who bravely put their mining tools and approaches on a dare.

In 2014, the challenge was on the GitHub data. We provide the data for the GitHub repository and you should use your brain, tools, computational power, and magic to uncover interesting findings related to it.

URL

Download the data (last updated on October 10, 2013 to include commit comments), available in two forms - MySQL dump (106 MB) from http://ghtorrent.org/downloads/msr14-mysql.gz.

Download this for the openscience.us schema  The dataset contains 90 GitHub projects and their folks that are not randomly selected and thus not representative of GitHub.

Versions After the initial release of the dataset, the users found errors and missing features. The list of versions along with the fixes is presented in the table below. Only the latest version is offered for download.

You are advised to always run queries against the newest version. If you have already downloaded an older version and the described fix does not affect your experiment, you could skip the update.

The MSR 2014 challenge dataset is a (very) trimmed down version of the original GHTorrent dataset. It includes data from the top-10 starred software projects for the top programming languages on Github, which gives 90 projects and their forks. For each project, we retrieved all data including issues, pull requests organizations, followers, stars and labels (milestones and events not included). The dataset was constructed from scratch to ensure the latest information is in it.

Similarly to GHTorrent itself, the MSR challenge dataset comes in two flavours:

A MongoDB database dump containing the results of querying the Github API. See format here. A MySQL database dump containing a queriable version of important fields extracted from the raw data. See schema here. The included projects are the following:

akka/akka devtools/hadley ProjectTemplate/johnmyleswhite stat-cookbook/mavam 
hiphop-php/facebook knitr/yihui shiny/rstudio folly/facebook 
mongo/mongodb doom3.gpl/TTimo phantomjs/ariya TrinityCore/TrinityCore 
MaNGOS/mangos bitcoin/bitcoin mosh/keithw xbmc/xbmc http-parser/joyent beanstalkd/kr 
redis/antirez ccv/liuliu memcached/memcached openFrameworks/openframeworks libgit2/libgit2 
redcarpet/vmg libuv/joyent SignalR/SignalR SparkleShare/hbons plupload/moxiecode 
mono/mono Nancy/NancyFx ServiceStack/ServiceStack AutoMapper/AutoMapper 
RestSharp/restsharp ravendb/ravendb MiniProfiler/SamSaffron storm/nathanmarz 
elasticsearch/elasticsearch 
ActionBarSherlock/JakeWharton facebook-android-sdk/facebook clojure/clojure 
CraftBukkit/Bukkit netty/netty android/github node/joyent jquery/jquery html5-boilerplate/h5bp impress.js/bartaz d3/mbostock chosen/harvesthq 
Font-Awesome/FortAwesome three.js/mrdoob foundation/zurb symfony/symfony
 CodeIgniter/EllisLab php-sdk/facebook zf2/zendframework cakephp/cakephp ThinkUp/ginatrapani phpunit/sebastianbergmann Slim/codeguy django/django tornado/facebook 
httpie/jkbr flask/mitsuhiko requests/kennethreitz symfony/xphere-forks 
reddit/reddit boto/boto django-debug-toolbar/django-debug-toolbar 
Sick-Beard/midgetspy django-cms/divio rails/rails homebrew/mxcl jekyll/mojombo gitlabhq/gitlabhq 
diaspora/diaspora devise/plataformatec blueprint-css/joshuaclayton 
octopress/imathis vinc.cc/vinc paperclip/thoughtbot compass/chriseppstein 
finagle/twitter kestrel/robey flockdb/twitter gizzard/twitter sbt/sbt scala/scala 
scalatra/scalatra zipkin/twitter

Importing and using The following instructions assume an OSX or Linux based host.

$ wget http://ghtorrent.org/downloads/msr14-mysql.gz
$ mysql -u root -p
mysql > create user 'msr14'@'localhost' identified by 'msr14';
mysql> create database msr14;
mysql> GRANT ALL PRIVILEGES ON msr14.* to msr14@'localhost';
mysql> flush privileges;
# Exit MySQL prompt
$ zcat msr14-mysql.gz |mysql -u msr14 -p msr14
$ mysql -u msr14 -p msr14
mysql> select language,count(*) from projects where forked_from is null group by language;
+------------+----------+
| language   | count(*) |
+------------+----------+
| C          |       10 |
| C#         |        8 |
| C++        |        8 |
| CSS        |        3 |
| Go         |        1 |
| Java       |        8 |
| JavaScript |        9 |
| PHP        |        9 |
| Python     |       10 |
| R          |        4 |
| Ruby       |       10 |
| Scala      |        9 |
| TypeScript |        1 |
+------------+----------+
13 rows in set (0.01 sec)

FAQ Answers to frequently asked questions

Why a new dataset?

For practical reasons. The dataset is small enough to be used on a laptop, yet rich enough to do really interesting research with it.

The MSR folks report that they have succesfully used these dumps into a 2011 MacBookAir with 4GB of RAM. Mileage may vary with the MySQL data dump, the hardware requirements are low.

Challenge Data

This year, the focus of the challenge is the GitHub data. GitHub is a web-based service providing a collaborative software development environment and a social network for developers. We provide you with the dataset extracted from GHTorrent by Georgios Gousios.

When you use the data provided by the MSR 2014 Challenge, we ask you to cite it as following:

@inproceedings{Gousi13,
  author = {Gousios, Georgios},
  title = {The GHTorrent dataset and tool suite},
  booktitle = {Proceedings of the 10th Working Conference on Mining Software Repositories},
  series = {MSR'13},
  year = {2013},
  isbn = {978-1-4673-2936-1},
  location = {San Francisco, CA, USA},
  pages = {233--236},
  numpages = {4},
  url = {http://dl.acm.org/citation.cfm?id=2487085.2487132}
} 

Acknowledgments We would like to thank Georgios Gousios from the Delft University of Technology for providing GHTorrent data.

The relational DB schema

Download Download PDF

Entities and their relationships

users

Github users.

Users that are members of an organization.

Information about repositories. A repository is always owned by a user.

Users that have commit access to the repository.

The created_at field is only filled in accurately for memberships for which GHTorrent has recorded a corresponding event. Otherwise, it is filled in with the latest date that the corresponding user or project has been created.

commits

Unique commits.

The parent commit(s) for each commit, as specified by Git.

project_commits

The commits belonging to the history of a project.

More than one projects can share the same commits if one is a fork of the other.

commit_comments

Code review comments on commits.

These are comments on individual commits. If a commit is associated with a pull request, then its comments are in the pull_request_comments table.

followers

A follower to a user.

The created_at field is only filled in accurately for followships for which GHTorrent has recorded a corresponding event. Otherwise, it is filled in with the latest date that the corresponding user or follower has been created.

watchers

Users that have starred (was watched) a project

The created_at field is only filled in accurately for starrings for which GHTorrent has recorded a corresponding event. Otherwise, it is filled in with the latest date that the corresponding user or project has been created.

pull_requests

A pull request initiated from head_repo_id:head_commit_id to base_repo_id:base_commit_id

An event in the pull request lifetime

The action field can take the following values

A commit associated with a pull request

The list is additive. This means if a rebase with commit squashing takes place after the commits of a pull request have been processed, the old commits will not be deleted.

pull_request_comments

A code review comment on a commit associated with a pull request

The list is additive. If commits are squashed on the head repo, the comments remain intact.

issues

An issue associated with a repository

An event on an issue

An entry to the issue discussion. This table is always filled in with pull request (or issue) discussion comments, irrespective of whether the repo has issues enabled or not.

repo_labels

A label to be assigned to an issue affecting this repository.

issue_labels

A label that has been assigned to an issue

Example queries

List commits for a repository

select c.*
from commits c, project_commits pc, projects p, users u
where u.login = 'rails'
  and p.name = 'rails'
  and p.id = pc.project_id
  and c.id = pc.commit_id
order by c.created_at desc

Get all actions for a pull request

select user, action, created_at from
(
  select prh.action as action, prh.created_at as created_at, u.login as user
  from pull_request_history prh, users u
  where prh.pull_request_id = ?
    and prh.actor_id = u.id
  union
  select ie.action as action, ie.created_at as created_at, u.login as user
  from issues i, issue_events ie, users u
  where ie.issue_id = i.id
    and i.pull_request_id = ?
    and ie.actor_id = u.id
  union
  select 'discussed' as action, ic.created_at as created_at, u.login as user
  from issues i, issue_comments ic, users u
  where ic.issue_id = i.id
    and u.id = ic.user_id
    and i.pull_request_id = ?
  union
  select 'reviewed' as action, prc.created_at as created_at, u.login as user
  from pull_request_comments prc, users u
  where prc.user_id = u.id
    and prc.pull_request_id = ?
) as actions
order by created_at;

Get participants in an issue or pull request

select distinct(user_id) from
(
  select user_id
  from pull_request_comments
  where pull_request_id = ?
  union
  select user_id
  from issue_comments ic, issues i
  where i.id = ic.issue_id and i.pull_request_id = ?
) as participants
CarterPape commented 9 years ago

Is there a link to the paper?

CarterPape commented 9 years ago

Owner (chairperson) email: obaysal@uwaterloo.ca

reesjones commented 9 years ago

I don't think this has a paper attached to it. It looks like a competition to me.

reesjones commented 9 years ago

Hasn't this already been added as msr14?

reesjones commented 9 years ago

Added as msr14.