opensciences / opensciences.github.io

Website for OpenScience -
http://openscience.us
MIT License
26 stars 18 forks source link

Intensive metrics for the study of the evolution of open source projects: Case studies from Apache Software Foundation projects #371

Open reesjones opened 9 years ago

reesjones commented 9 years ago

Paper link: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6624023

BenProvince commented 9 years ago

The raw data the the authors used is available. The authors discuss processing this data into CSV files and made a replication package available, but the link to this package is broken. We should contact the authors.

Original (Raw) e-mail Data

This paper discusses the use of metrics which can be calculated from e-mail records about a project. The original Apache e-mails are available at the following link: http://mail-archives.apache.org/mod_mbox/

This is nothing but archives of thousands of e-mails. There is a system for browsing them on the web, but the actual .mbox files can also be downloaded. These can be accessed using a program called mbox. The mbox man page is available at the following link: http://www.qmail.org/man/man5/mbox.html

Cleaned Data

The authors discuss the cleaning of the data as follows and give a URL for a replication package with all of the scripts mentioned below.

A Shell script has been used for processing the mbox email files. The result is a file containing lists of comma separated value (CSV) tables per project. A number of regular expressions are then used to select, or filter out, automatically generated messages or messages belonging to the various types of traffic. Then, a Python script has been created to break the results into a single CSV file per project. The output is used with the R statistical environment: the data is read, transformed into data frames. Later on, it is processed so that it takes the form of a time series (ts) with the leading and trailing months without activity stripped out, and a series of derived and smoothed ratios are computed and added. For smoothing we have used loess, a polinomial estimate.

The URL for the replication package is a broken link, so we should contact the authors: http://gsyc.urjc.es/∼grex/repro/2013-apache-intensive

Authors

Santiago Gala-Perez (sgala@apache.org) Gregorio Robles (grex@gsyc.urjc.es) Jesus M. Gonzalez-Barahona (jgb@gsyc.urjc.es) Israel Herraiz (israel.herraiz@upm.es)