Closed minkull closed 4 years ago
Hello,
Some comments: the repos_dataset_all.csv link in https://github.com/researchart/rose6icse/blob/master/submissions/available/malavolta/INSTALL.md seems to be broken
Actually, this is true for most links in that file, although I can find everything anyway in the main repo link.
What is the difference between: online_questionnaire_responses.csv and |--- online_questionnaire_responses_raw.csv?
What is the golden set in dataset? Files like that I would either explain or remove.
In the full replication section "Configure GHTorrent (instructions) as a MySQL database instance, run all the queries in ghtorrent_queries.sql, and save the final result in dataset/repos_mining_dataintermediateResults/2_ghtorrent_github.json" Does this not have to be done three times for the 3 different types of repositories (also bitbucket, gitlab)? Is this step just for the github data? Is there a '/' missing in the file name?
Also, for that step and the bullet above, should these intermediate files actually be in the repo under dataset? Or they are left out deliberately to be rebuilt by others. I see they are in the Archive zip, but the file names here don't quite match the description above. Quite a few files are found in the archive in intermediate results. Are these all created automatically? If not, what are the differences and naming conventions?
The list of .py scripts will automatically know which of these files to use as input? Or must the user indicate which files are input? Or must they be run for all intermediate files? It's not clear if/how to modify the files names at the top of these files, of if they can be run without modification.
Overall, the classified data provided is useful, but it may be tricky to re-create the data without some clarifications.
I have no major issues with the dataset itself, and a badge for Reusable seems fine to me. The README is clear, the file structure is accessible, and I had minimal work rebuilding the data plots. There are a few minor R things that were easy to fix. I could not see where the DOI/permanent link was for an Available badge.
I'm concerned that you are sharing email addresses in the package. GHTorrent famously had the problem of angry OSS devs startled to find their emails were publicly shared. I'm also a bit surprised your ethical approval and/or GDPR allows for this sharing of PII. From what I can see of your questionnaire you never seek permission and indeed it says Optional, we will use it only once for sending the results of our study
Perhaps another thing to add to this dataset is the HSR ethics certificate.
Personally, I don't see the replication value in adding them, because you risk burning some very useful contacts in the ROS community.
I would consider the "here" package instead of setwd() - setwd('.') does nothing useful. See https://malco.io/2018/11/05/why-should-i-use-the-here-package-when-i-m-already-using-projects/
I get Error in grDevices::pdf(file = filename, ..., version = version) : cannot open file './output/SystemType.pdf'
when running the script, I think because the folder is expected to exist.
@iivanoo over to you.
Thanks @minkull and the anonymous reviewers. Below I will report how we addressed the provided comments. The changes to the original submission of the artifact are available in PR https://github.com/researchart/rose6icse/pull/178.
the repos_dataset_all.csv link in https://github.com/researchart/rose6icse/blob/master/submissions/available/malavolta/INSTALL.md seems to be broken Actually, this is true for most links in that file, although I can find everything anyway in the main repo link.
We corrected all the broken links, thanks for spotting them.
What is the difference between: online_questionnaire_responses.csv and |--- online_questionnaire_responses_raw.csv?
The online_questionnaire_responses_raw.csv
file contains the answers provided by the participants of the questionnaire as they have been exported from the Google Drive form (email addresses have been anonymized). Differently, the online_questionnaire_responses.csv
file contains a superset of the previous file and includes also the codes we added when analysing participants' answers. We expanded the description of the mentioned files in order to clarify their differences.
What is the golden set in dataset? Files like that I would either explain or remove.
The golden set is the list of repositories we knew a priori were good candidates for our study and we used such a set for (i) double check if our repository filtering steps were too strict and (ii) for piloting the manual analysis of the contents of the repositories. We added the description above in the new version of the replication package.
In the full replication section "Configure GHTorrent (instructions) as a MySQL database instance, run all the queries in ghtorrent_queries.sql, and save the final result in dataset/repos_mining_dataintermediateResults/2_ghtorrent_github.json" Does this not have to be done three times for the 3 different types of repositories (also bitbucket, gitlab)? Is this step just for the github data? Is there a '/' missing in the file name?
The commented step is done only for GitHub since GHTorrent is a mirror of the GitHub platform only. The other platforms (i.e., bitbucket and gitlab) are mined by using rosmap.
We fixed the typo in the mentioned file name, thanks for spotting it.
Also, for that step and the bullet above, should these intermediate files actually be in the repo under dataset? Or they are left out deliberately to be rebuilt by others. I see they are in the Archive zip, but the file names here don't quite match the description above. Quite a few files are found in the archive in intermediate results. Are these all created automatically? If not, what are the differences and naming conventions?
The Archive.zip file is included in the replication package for completeness. All the intermediate files are recreated when rerunning all the mining scripts.
The list of .py scripts will automatically know which of these files to use as input? Or must the user indicate which files are input? Or must they be run for all intermediate files? It's not clear if/how to modify the files names at the top of these files, of if they can be run without modification.
The Python scripts have always a preamble containing the paths of both the files expected as input and the files produced as output. The rationale behind this choice is that the interested researcher can execute the Python scripts as they are, without modifying the files. If needed, the preamble provides an easy-to-identify place where all the relevant file system paths can be customized according to the needs of the researcher reconstructing the dataset.
Overall, the classified data provided is useful, but it may be tricky to re-create the data without some clarifications.
Thanks, we hope that the revised readme and our clarifications will help.
I have no major issues with the dataset itself, and a badge for Reusable seems fine to me. The README is clear, the file structure is accessible, and I had minimal work rebuilding the data plots. There are a few minor R things that were easy to fix. I could not see where the DOI/permanent link was for an Available badge.
Thanks. We integrated the GitHub repository containing our replication package with Zenodo. Here is the DOI: 10.5281/zenodo.3672050
I'm concerned that you are sharing email addresses in the package. GHTorrent famously had the problem of angry OSS devs startled to find their emails were publicly shared. I'm also a bit surprised your ethical approval and/or GDPR allows for this sharing of PII. From what I can see of your questionnaire you never seek permission and indeed it says Optional, we will use it only once for sending the results of our study. Perhaps another thing to add to this dataset is the HSR ethics certificate. Personally, I don't see the replication value in adding them, because you risk burning some very useful contacts in the ROS community.
The reviewer is right. We removed the email addresses from our replication package.
I would consider the "here" package instead of setwd() - setwd('.') does nothing useful. See https://malco.io/2018/11/05/why-should-i-use-the-here-package-when-i-m-already-using-projects/
Good suggestion, we are using it now in our R script.
I get Error in
grDevices::pdf(file = filename, ..., version = version) : cannot open file './output/SystemType.pdf
when running the script, I think because the folder is expected to exist.
Yes, the output folder should be already existing. We fixed the error by adding the folder into the GitHub repository of our replication package.
Thanks. I'm satisfied with the answers. I would award available and reusable.
https://github.com/researchart/rose6icse/tree/master/submissions/available/malavolta https://github.com/researchart/rose6icse/tree/master/submissions/reusable/malavolta
Ivano Malavolta (corresponding author)
Grace A. Lewis
Bradley Schmerl
Patricia Lago
David Garlan
Note to reviewers: these authors want multiple badges