For instructions to execute the extraction, please read help.txt file
This project aims to mining github machine learning repositories in order to understand how they deal with datasets. It is part of of a bigger project having three reseach questions:
We analysed repositories from the Paper Of Code project which put together machine learning research papers with their corresponding github repository. We then filtered it to get only repos writen in python language.
a) Downloaded the paper of code files (json)
b) Extracted, through github API, the repos main language and the number of commits.the repositories with python as main language
c)We filtered by removing duplicates and project that didn't have python 3 as main language (we focused only on python 3 because python 3 ast is not compatible with python 2 ast). Duplicates are projects that have the same github link. These are usually different versions of the paper but still linked to the same github repository.
d) The result is a csv file with headers id; unique id of the repository, link; github link of the repo and nb_commits: total commits in the repository.
e) For each commits, we extracted some information we think are relevant for the analysis, and store them to a postgresql database.
Idx | Field Name | Data Type | Description |
---|---|---|---|
* | id | integer GENERATED BY DEFAULT AS IDENTITY | |
* | file_id | integer | |
change_type | varchar | ||
commit_id | integer |
Idx | Field Name | Data Type | Description |
---|---|---|---|
* | id | integer GENERATED BY DEFAULT AS IDENTITY | |
* | repo_id | integer | |
* | sha | varchar | |
commit_date | timestamp | ||
author_name | varchar | ||
author_email | varchar | ||
total_modifs | integer |
Idx | Field Name | Data Type | Description |
---|---|---|---|
* | id | integer GENERATED BY DEFAULT AS IDENTITY | |
* | element_id | integer | |
* | heuristic | varchar(2) | The heuristic used to identify as dataset |
* | file_mention | integer | The file where the dataset is loaded |
* | repo_id | integer |
Information of files of folders in the repositories
Idx | Field Name | Data Type | Description |
---|---|---|---|
* | id | integer GENERATED BY DEFAULT AS IDENTITY | |
name | varchar(500) | ||
is_code_file | bool | Set to True or False if file has code or not | |
ast | json | json ast of the file's code | |
repo_id | integer | ||
* | is_folder | bool | True if it is a folder, false if it is not |
extension | varchar | ||
imports | text | List of imported libraries in the element (file) |
Repositories to analyse
Idx | Field Name | Data Type | Description |
---|---|---|---|
* | id | integer | |
* | link | varchar(5000) | Github link |
nb_commits | integer | Total commits in the repository | |
name | varchar(500) | ||
folder_name | varchar(500) |
f) For RQ3, it may be interesting to assess the number of commits as another filtering criteria. Remove projects with less than x commits to keep only projects with the sufficient amount of commits.
For confidenttiality, researchers somethimes develop their code out of Github or in a private Repository, then once the paper is published, they then pull all the code into one or few commits to Github.
This removes all the medatadata associated to the project evolution and makes the project impossible to mine.
Given the fact that there are several ways to store data and that GitHub keeps tracks of changes in the project through files, we study how datafiles are stored as and how they can impact the evolution of the repository.
Using python ast, we extracted the imported libraries from each repository, then analysed how these libraries are used in all the repositories using association mining
We developed some heuristics to identify dataset:
1) h1: file and folder name: Any non code file or folder name containing the String *data*
is considered as storing data files.
) h2: Non code file or folder name is loaded in the code: Using python ast, we find all mentions of non-code files and folder names in code files. Ignoring standard files like README.md, setyp.py, requirements.txt and files ending with extensions "", ".md", ".yml", ".sh", ".h"
It may be interesting to find out if the data is mentioned as input or output dataset by using the function calling it. But due to the disparity and the huge number of possibilities to use data (custom function, function open(), library function), this may a hard task that will need time to identify at an interesting number of possibilities that won't bias the results. We can use the libraries from rq1 and checck the functions each library implemented to load datasets, along with built in funtion open('filename', 'mode') to determine is the dataset is loaded as input of output.
the third rq is to assess how the datafiles and the code evolve together. For that, for each identified datafile (a file from the repo is identified as datafile according to the heuristics of the rq2), we checked all the commits modifying that file (C). We then looked at how many time a file appears in the commits. Let's say the file "data.csv" has been modified n=10 times (there are ten commits where that file appears). for each commit C[i] in C, we checked all the files modified by that commit. Let's call C[n][j] the jth file of the commit commit C[i]. then we check the presence of C[i][j] in all the commits (from 0 to 10) and count how many times we found that file.
(Not yet done) Need to use a mathematical method combining the number of commits and the occurence of the file in the commits to asses what is the treshold from wich a file is correlated to another file in its evolution. Also may be interesting to check if the reverse is true (if a correlated to b implies b is correlated to a)