h1alexbel / srdataset

GitHub repositories dataset that contains sample repositories (SRs), with their metrics and metadata
MIT License
4 stars 0 forks source link
docker github-repositories research-dataset shell

srdataset

py PDD status Hits-of-Code License

SRdataset is an unlabeled dataset of GitHub repositories containing SRs (sample repositories).

Motivation. During work on models for samples-filter project, we discovered the need for the automation of the dataset building process on remote servers, since we need to collect automatically a number of GitHub repositories of data being productive in our research. In order to do this, we integrated ghminer with a few scripts, and packaged all of that as Docker container.

How to use

To build a new version of dataset run this:

docker run --detach --name=srdataset --rm --volume "$(pwd):/srdataset" \
  -e "CSV=repos" \
  -e "SEARCH_QUERY=<query>" \
  -e "START_DATE=2019-01-01" \
  -e "END_DATE=2024-05-01" \
  -e "HF_TOKEN=xxx" \
  -e "INFERENCE_CHECKPOINT=sentence-transformers/all-MiniLM-L6-v2" \
  -e "PATS=pats.txt" \
  --oom-kill-disable \
  abialiauski/srdataset:0.0.1

Where <query> is the search query to the GitHub API, 2019-01-01 is a start date to search the repositories those were created at this date, 2024-05-01 is an end to search the repositories those were created at this date, xxx is HuggingFace token, required for accessing inference endpoint in order to generate textual embeddings; pats.txt is file contains a number of GitHub PATs.

The building process can take a while. After it completed, you should have these files:

If you run container with -e PUSH_TO_HF=true, you should expect that after preprocess, we will push output CSV files to the files to the profile, passed to -e HF_PROFILE, within provided HF_TOKEN. All outputs will be pushed into datasets with sr- prefix.

If you run container with -e "CLUSTER=true", you should have one ZIP file named like clusters-2024-06-21-18:22.zip and containing these files:

agglomerative
  /mix
    /members
    ...
  /numerical
    /members
    ...
  /textual
    /members
    ...
dbscan/... (the same structure)
gmm/...
kmeans/...
source/...

All experiments are grouped by model name: kmeans, dbscan, agglomerative, etc. In each that model directory you should have members directory and a set of plots. members contains a set of text files tagged with output cluster label e.g. 0.txt. In source you should have all CSV files that were used to generate clusters.

How to contribute

Fork repository, make changes, send us a pull request. We will review your changes and apply them to the master branch shortly, provided they don't violate our quality standards. To avoid frustration, before sending us your pull request please run full make build:

make env test