tna-hub / mining_ml_repos

mining ml repositories
1 stars 0 forks source link

Mining machine learning repositories

For instructions to execute the extraction, please read help.txt file

This project aims to mining github machine learning repositories in order to understand how they deal with datasets. It is part of of a bigger project having three reseach questions:

Dataset

We analysed repositories from the Paper Of Code project which put together machine learning research papers with their corresponding github repository. We then filtered it to get only repos writen in python language.

a) Downloaded the paper of code files (json)

b) Extracted, through github API, the repos main language and the number of commits.the repositories with python as main language

c)We filtered by removing duplicates and project that didn't have python 3 as main language (we focused only on python 3 because python 3 ast is not compatible with python 2 ast). Duplicates are projects that have the same github link. These are usually different versions of the paper but still linked to the same github repository.

d) The result is a csv file with headers id; unique id of the repository, link; github link of the repo and nb_commits: total commits in the repository.

e) For each commits, we extracted some information we think are relevant for the analysis, and store them to a postgresql database.


Table commit_modifications

Idx Field Name Data Type Description
* id integer GENERATED BY DEFAULT AS IDENTITY
* file_id integer
  change_type varchar
commit_id integer

Table commits

Idx Field Name Data Type Description
* id integer GENERATED BY DEFAULT AS IDENTITY
* repo_id integer
* sha varchar
  commit_date timestamp
  author_name varchar
  author_email varchar
  total_modifs integer

Table datasets

Idx Field Name Data Type Description
* id integer GENERATED BY DEFAULT AS IDENTITY
* element_id integer
* heuristic varchar(2) The heuristic used to identify as dataset
* file_mention integer The file where the dataset is loaded
* repo_id integer

Table element

Information of files of folders in the repositories

Idx Field Name Data Type Description
* id integer GENERATED BY DEFAULT AS IDENTITY
  name varchar(500)
  is_code_file bool Set to True or False if file has code or not
  ast json json ast of the file's code
repo_id integer
* is_folder bool True if it is a folder, false if it is not
  extension varchar
  imports text List of imported libraries in the element (file)

Table repos

Repositories to analyse

Idx Field Name Data Type Description
* id integer
* link varchar(5000) Github link
  nb_commits integer Total commits in the repository
  name varchar(500)
  folder_name varchar(500)


f) For RQ3, it may be interesting to assess the number of commits as another filtering criteria. Remove projects with less than x commits to keep only projects with the sufficient amount of commits. For confidenttiality, researchers somethimes develop their code out of Github or in a private Repository, then once the paper is published, they then pull all the code into one or few commits to Github. This removes all the medatadata associated to the project evolution and makes the project impossible to mine.

Method

Given the fact that there are several ways to store data and that GitHub keeps tracks of changes in the project through files, we study how datafiles are stored as and how they can impact the evolution of the repository.

RQ1

Using python ast, we extracted the imported libraries from each repository, then analysed how these libraries are used in all the repositories using association mining

RQ2

We developed some heuristics to identify dataset: 1) h1: file and folder name: Any non code file or folder name containing the String *data* is considered as storing data files. ) h2: Non code file or folder name is loaded in the code: Using python ast, we find all mentions of non-code files and folder names in code files. Ignoring standard files like README.md, setyp.py, requirements.txt and files ending with extensions "", ".md", ".yml", ".sh", ".h"

It may be interesting to find out if the data is mentioned as input or output dataset by using the function calling it. But due to the disparity and the huge number of possibilities to use data (custom function, function open(), library function), this may a hard task that will need time to identify at an interesting number of possibilities that won't bias the results. We can use the libraries from rq1 and checck the functions each library implemented to load datasets, along with built in funtion open('filename', 'mode') to determine is the dataset is loaded as input of output.

RQ3

the third rq is to assess how the datafiles and the code evolve together. For that, for each identified datafile (a file from the repo is identified as datafile according to the heuristics of the rq2), we checked all the commits modifying that file (C). We then looked at how many time a file appears in the commits. Let's say the file "data.csv" has been modified n=10 times (there are ten commits where that file appears). for each commit C[i] in C, we checked all the files modified by that commit. Let's call C[n][j] the jth file of the commit commit C[i]. then we check the presence of C[i][j] in all the commits (from 0 to 10) and count how many times we found that file.

(Not yet done) Need to use a mathematical method combining the number of commits and the occurence of the file in the commits to asses what is the treshold from wich a file is correlated to another file in its evolution. Also may be interesting to check if the reverse is true (if a correlated to b implies b is correlated to a)