softwaresaved / rse-repo-analysis

Study of research software in repositories. Contact: @karacolada
BSD 3-Clause "New" or "Revised" License
12 stars 0 forks source link

RSE Repository Analysis

Study of research software in repositories. Contact: @karacolada

About this project

We want to investigate how research software projects have changed over time, how they evolve and how they differ between disciplines by analysing relevant software repositories. This will help us gain a better understanding of ongoing processes in the research software community and of how they can be supported. It will also supply evidence about which practices aid to build and maintain software with a wider community engagement.

We consider multiple approaches to finding relevant software repositories:

Hypothesis

The following is not meant to be a fixed set of things we want to find, but rather some ideas that may help guide what data is worth collecting. We hypothesise that

  1. research software repositories evolve in four stages
    1. no engagement: sparse commits, no issues, few authors, no license, no DOI citation
    2. publication: DOI, license, usage guidelines, some watchers/stars, some issues created and resolved by repository maintainers
    3. low engagement: external users create issues, maintainers resolve issues, forks
    4. community engagement: external users create and resolve issues, merge requests
  2. research software repositories that employ good practices reach higher stages (earlier)

Potential indicators

This list of indicators is meant for brainstorming. Not all data listed here will be collected in the end.

Contextual Metadata

To contextualise the result, we record information about the initial publication (title, author, ...). This could later be used to find the publication on CrossRef etc. and collect further information such as:

Getting started

Clone this repository, then fetch the submodules:

cd rse-repo-analysis
git submodule init
git submodule update

Usage

Most folders contain their own README file summarising their contents and how the scripts should be used. More details can be found in the wiki.

Requirements

As this project is developed using Python, it is recommended to set up a new virtual environment, e.g. using Conda. Inside the environment, install the following packages or use environment.yml:

GitHub API

It is advised to create an access token to authorise with the GitHub API, otherwise you will quickly run into the requests limit. You can create a token here. Scripts making use of the GitHub API in this project will usually check for a file called config.cfg and expect your access token to be in there. As access tokens should be kept secret, files named config.cfg will not be tracked by Git.

To provide the code with your access token, simply create a copy of config_example.cfg, fill in your data and rename the copy to config.cfg. This file should be located in the root directory of this repository, but you might also need a copy of it in the software-mentions directory if you want to work with the code in the submodule.

Datasets

Depending on what you are trying to do, you will need to download datasets and place them in the correct spot of the repository. This might be changed to configurable paths in the future, but for now, that is out of scope.

CZI Dataset

The code in software-mentions expects the CZI dataset in its root directory, i.e. software-mentions/data. You can download the dataset here and extract it into the correct location.

References

Here, we list some works that we make use of.

CZI Software Mentions

Istrate, A. M., Li, D., Taraborelli, D., Torkar, M., Veytsman, B., & Williams, I. (2022). A large dataset of software mentions in the biomedical literature. arXiv preprint arXiv:2209.00693.