Mining machine learning repositories

For instructions to execute the extraction, please read help.txt file

This project aims to mining github machine learning repositories in order to understand how they deal with datasets. It is part of of a bigger project having three reseach questions:

RQ1: What are the main libraries used in these projects?
RQ2: How do developers deal with datasets in ML projects?
RQ3: How do ML projects evolve?

Dataset

We analysed repositories from the Paper Of Code project which put together machine learning research papers with their corresponding github repository. We then filtered it to get only repos writen in python language.

a) Downloaded the paper of code files (json)

b) Extracted, through github API, the repos main language and the number of commits.the repositories with python as main language

c)We filtered by removing duplicates and project that didn't have python 3 as main language (we focused only on python 3 because python 3 ast is not compatible with python 2 ast). Duplicates are projects that have the same github link. These are usually different versions of the paper but still linked to the same github repository.

d) The result is a csv file with headers id; unique id of the repository, link; github link of the repo and nb_commits: total commits in the repository.

e) For each commits, we extracted some information we think are relevant for the analysis, and store them to a postgresql database.

Table commit_modifications

Idx	Field Name	Data Type
*	id	integer GENERATED BY DEFAULT AS IDENTITY
*	file_id	integer
	change_type	varchar
	commit_id	integer

Table commits

Idx	Field Name	Data Type
*	id	integer GENERATED BY DEFAULT AS IDENTITY
*	repo_id	integer
*	sha	varchar
	commit_date	timestamp
	author_name	varchar
	author_email	varchar
	total_modifs	integer

Table datasets

Idx	Field Name	Data Type	Description
*	id	integer GENERATED BY DEFAULT AS IDENTITY
*	element_id	integer
*	heuristic	varchar(2)	The heuristic used to identify as dataset
*	file_mention	integer	The file where the dataset is loaded
*	repo_id	integer

Table element

Information of files of folders in the repositories

Idx	Field Name	Data Type	Description
*	id	integer GENERATED BY DEFAULT AS IDENTITY
	name	varchar(500)
	is_code_file	bool	Set to True or False if file has code or not
	ast	json	json ast of the file's code
	repo_id	integer
*	is_folder	bool	True if it is a folder, false if it is not
	extension	varchar
	imports	text	List of imported libraries in the element (file)

Table repos

Repositories to analyse

Idx	Field Name	Data Type	Description
*	id	integer
*	link	varchar(5000)	Github link
	nb_commits	integer	Total commits in the repository
	name	varchar(500)
	folder_name	varchar(500)

f) For RQ3, it may be interesting to assess the number of commits as another filtering criteria. Remove projects with less than x commits to keep only projects with the sufficient amount of commits. For confidenttiality, researchers somethimes develop their code out of Github or in a private Repository, then once the paper is published, they then pull all the code into one or few commits to Github. This removes all the medatadata associated to the project evolution and makes the project impossible to mine.

Method

Given the fact that there are several ways to store data and that GitHub keeps tracks of changes in the project through files, we study how datafiles are stored as and how they can impact the evolution of the repository.

RQ1

Using python ast, we extracted the imported libraries from each repository, then analysed how these libraries are used in all the repositories using association mining

RQ2

We developed some heuristics to identify dataset: 1) h1: file and folder name: Any non code file or folder name containing the String *data* is considered as storing data files. ) h2: Non code file or folder name is loaded in the code: Using python ast, we find all mentions of non-code files and folder names in code files. Ignoring standard files like README.md, setyp.py, requirements.txt and files ending with extensions "", ".md", ".yml", ".sh", ".h"

It may be interesting to find out if the data is mentioned as input or output dataset by using the function calling it. But due to the disparity and the huge number of possibilities to use data (custom function, function open(), library function), this may a hard task that will need time to identify at an interesting number of possibilities that won't bias the results. We can use the libraries from rq1 and checck the functions each library implemented to load datasets, along with built in funtion open('filename', 'mode') to determine is the dataset is loaded as input of output.

RQ3

the third rq is to assess how the datafiles and the code evolve together. For that, for each identified datafile (a file from the repo is identified as datafile according to the heuristics of the rq2), we checked all the commits modifying that file (C). We then looked at how many time a file appears in the commits. Let's say the file "data.csv" has been modified n=10 times (there are ten commits where that file appears). for each commit C[i] in C, we checked all the files modified by that commit. Let's call C[n][j] the jth file of the commit commit C[i]. then we check the presence of C[i][j] in all the commits (from 0 to 10) and count how many times we found that file.

(Not yet done) Need to use a mathematical method combining the number of commits and the occurence of the file in the commits to asses what is the treshold from wich a file is correlated to another file in its evolution. Also may be interesting to check if the reverse is true (if a correlated to b implies b is correlated to a)

tna-hub / mining_ml_repos

readme