The goal of the study is to create a model that, by looking at the README file and meta-information, can identify GitHub "sample repositories" (SR), that mostly contain educational or demonstration materials supposed to be copied instead of reused as a dependency.
Motivation. During the work on CaM project, we were required to filter out repositories with samples. No readily available technique or tool existed that could perform that function, so we conducted research on this very subject.
The repository structured as follows:
Our research based on the following hypotheses:
SR
s usually don't have release pipeline inside .github/workflows
SR
s usually have less strict build pipeline inside .github/workflows
SR
s usually don't have releasesSR
s have less pull requestsSR
s don't have section about how to use itSR
s have more disconnected directories/filesFirst, prepare datasets:
docker run --rm -v "$(pwd)/output:/collection" -e START="<start date>" \
-e END="<end date>" -e COLLECT_TOKEN="<GitHub PAT to collect repositories>" \
-e COLLECT_TOKEN="<GitHub PAT to fetch metadata>" \
-e HF_TOKEN="<Huggingface PAT>" -e COHERE_TOKEN="<Cohere API token>" \
-e OUT="sr-data" h1alexbel/sr-detection
In the output directory you should have these datasets:
d1-scores.csv
d2-sbert.csv
d3-e5.csv
d4-embedv3.csv
d5-scores+sbert.csv
d6-scores+e5.csv
d7-scores+embedv3.csv
Alternatively, you can download existing datasets from gh-pages branch.
Then, you should run models against collected datasets:
just cluster
TBD..
Make sure that you have Python 3.10+, just, and npm installed on your
system, fork this repository, make changes, send us a pull request.
We will review your changes and apply them to the master
branch shortly,
provided they don't violate our quality standards. To avoid frustration, before
sending us your pull request please run full build:
just full