paser-group / MLForensics

Placeholder for source code and other relevant artifacts for research project titled `Forensic Anti-patterns for Machine learning`
0 stars 0 forks source link

Task#4: Mining Datasets from GitHub and GitLab #4

Open akondrahman opened 3 years ago

akondrahman commented 3 years ago

Task 4.1: Mine GitHub Datasets

  1. Write a Python script that will go through the list here
    INITIAL_PYTHON_REPOS_GITHUB.xlsx

    and will download each repo. For help see this: https://github.com/paser-group/Microservice-Security/blob/master/repo_name_downloader.py

  2. Upon downloading check if the following appears in any of the following appears in at least one Python or IPython file: sklearn, h5py, gym, rl, tensorflow, keras, tf, stable_baselines, tensorforce, rl_coach, pyqlearning, MAMEToolkit, chainer, torch, chainerrl If they appear keep the repo, otherwise delete it. ... for help see this: https://github.com/paser-group/Microservice-Security/blob/master/repo_name_downloader.py
  3. For the remaining repos write a script that automatically calculates commit per month, developer count, Count of Python files, Count of total files, Count of commits ... for help see this: https://github.com/paser-group/Microservice-Security/blob/master/eureka_checker.py

Once you are done send me the final CSV from step#3.

akondrahman commented 3 years ago

Task 4.2: Mine GitLab Datasets

Use the following list to repeat the same tasks as Task # 4.1

INITIAL_PYTHON_REPOS_GITLAB.txt

akondrahman commented 3 years ago

@fbhuiyan42

As a gentle reminder once you are done, send me the CSV files with the metric values and the repos with only Python files and notebooks.