Study of research software in repositories. Contact: @karacolada
We want to investigate how research software projects have changed over time, how they evolve and how they differ between disciplines by analysing relevant software repositories. This will help us gain a better understanding of ongoing processes in the research software community and of how they can be supported. It will also supply evidence about which practices aid to build and maintain software with a wider community engagement.
We consider multiple approaches to finding relevant software repositories:
The following is not meant to be a fixed set of things we want to find, but rather some ideas that may help guide what data is worth collecting. We hypothesise that
This list of indicators is meant for brainstorming. Not all data listed here will be collected in the end.
To contextualise the result, we record information about the initial publication (title, author, ...). This could later be used to find the publication on CrossRef etc. and collect further information such as:
Clone this repository, then fetch the submodules:
cd rse-repo-analysis
git submodule init
git submodule update
Most folders contain their own README file summarising their contents and how the scripts should be used. More details can be found in the wiki.
src
: any source code for the main bulk of this work (collecting and analysing RSE GitHub repositories)
data
: data used for the main bulk of this work
tex
: preliminary report on the main bulk of this worksoftware-mentions
: submodule containing a fork of the Chan Zuckerberg Initiative's Software Mentions Repository
SSI-notebooks
: our own scripts handling the CZI Software Mentions datasetAs this project is developed using Python, it is recommended to set up a new virtual environment, e.g. using Conda.
Inside the environment, install the following packages or use environment.yml
:
pandas
Jupyter
matplotlib
PyGithub
for access to the GitHub APIlxml
for parsing XMLspdfminer.six
for parsing PDFs
pdfminer
, which is no longer actively maintainedpySpark
pyarrow
emoji
Levenshtein
unidecode
pydriller
wordcloud
seaborn
tol_colors
It is advised to create an access token to authorise with the GitHub API, otherwise you will quickly run into the requests limit.
You can create a token here.
Scripts making use of the GitHub API in this project will usually check for a file called config.cfg
and expect your access token to be in there.
As access tokens should be kept secret, files named config.cfg
will not be tracked by Git.
To provide the code with your access token, simply create a copy of config_example.cfg
, fill in your data and rename the copy to config.cfg
.
This file should be located in the root directory of this repository, but you might also need a copy of it in the software-mentions
directory if you want to work with the code in the submodule.
Depending on what you are trying to do, you will need to download datasets and place them in the correct spot of the repository. This might be changed to configurable paths in the future, but for now, that is out of scope.
The code in software-mentions
expects the CZI dataset in its root directory, i.e. software-mentions/data
.
You can download the dataset here and extract it into the correct location.
Here, we list some works that we make use of.
Istrate, A. M., Li, D., Taraborelli, D., Torkar, M., Veytsman, B., & Williams, I. (2022). A large dataset of software mentions in the biomedical literature. arXiv preprint arXiv:2209.00693.