h1alexbel / samples-filter

Command-line filter for GitHub repositories that contain "samples", instead of real project or framework or library
MIT License
5 stars 0 forks source link
dataset-filtering github machine-learning research-project

samples-filter

EO principles respected here DevOps By Rultor.com We recommend IntelliJ IDEA

py PyPI - Version codecov PDD status Hits-of-Code License Known Vulnerabilities

Samples-filter is a command-line filter for GitHub repositories that contain sample repositories (SR), that mostly contain educational or demonstration materials supposed to be copied instead of reused as a dependency, like framework or library. E.g. leeowenowen/rxjava-examples, streaming-with-flink/examples-java, redisson/redisson-examples.

Motivation. During the work on CaM project, where we're building datasets with open source Java programs, we discovered the need for filtering out repositories that contain samples, tutorials or examples. This repository is portable command-line tool that filters those repositories.

How to use

First, install it from PyPI like that:

pip install samples-filter

then, execute:

samples-filter filter --repositories=repos.csv --out=filtered.csv

For --repositories you should provide a name of existing CSV dataset with GitHub repositories, and name for the output file in --out (it will be created automatically). If you feel missed, try --help and tool will explain to you what you should do.

Optionally, you can decide which model to use for filtering via --model. You can pass either transformer (the default one), or ml.

Warning! Versions <=0.5.1 utilized models based on supervised learning algorithms, such as Random-Forest and fine-tuned transformer model based on DistilBERT. Besides that models were able to handle binary classification only. In contrast, latest versions using models that are based on unsupervised learning, and can output the rating of how input repository is similar to SR.

How to contribute

Fork repository, make changes, send us a pull request. We will review your changes and apply them to the master branch shortly, provided they don't violate our quality standards. To avoid frustration, before sending us your pull request please run full build:

make install cov check

To set up virtual environment use this set of commands:

python3 -m venv venv
source $(pwd)/venv/bin/activate

You will need Python 3.11+ installed.