rora00 / toy-dataset-ranking

Estimates dataset usage for common toy datasets in Python and R using Github Search API
MIT License
0 stars 0 forks source link
r sklearn

toy-dataset-ranking

This repository allows users to estimate the popularity of "toy" datasets i.e. synthetic or real datasets that are typically used to quickly test visualisation or models e.g. Iris

Setup

Clone the repository, navigate to it using your command line, and set up a virtual python environment (e.g. using conda). Activate your virtual environment and run

pip install -r requirements.txt

Create a .env file in the root of the repository and set the value of the variable GITHUB_TOKEN to your personal access token. Instructions on how to obtain a fine-grained personal access token can be found here.

Once the dependencies are installed and token is set then run the script using

python query_api.py

The output will be generated as a .csv file.

Methodology

The popularity of a dataset is estimated by finding the number of repositories where the dataset is loaded. This is currently implemented for datasets within python sklearn and base R. The exact query can be found within query_api.py. Note that the search uses Github Search API and results may differ from Github's Blackbird UI search.

Results

scikit-learn Dataset Usage in Github Repositories