refactoring-ai / Data-Collection

Collect refactorings with metrics from java source code.
MIT License
6 stars 1 forks source link
appendix database dataset docker java machine-learning metrics mysql refactoring software-engineering

Machine Learning for Software refactoring

codecov

This repository contains the data-collection part on the use of machine learning methods to recommend software refactoring that collects refactoring and non-refactoring instances from java source code that are later used to train the ML algorithms with a large variety of metrics.

Quickstart

  1. Prepare a MariaDB instance and create a database with the name refactoring_ai (if you have docker a quick way is: docker run -p 127.0.0.1:3306:3306 --name some-mariadb -e MYSQL_ROOT_PASSWORD=root -d mariadb these are also the default credentials)
  2. Build jar with dependencies: ./gradlew quarkusBuild
  3. Define the projects to mine refactorings in input.csv
  4. Start mining: java -jar java -jar build/datacollection-0.1.0-runner.jar

Configuration

Configuration can be done with environment variables. You can also create an .env file with the variables. See .env.example for each variable and its explanation

Paper and appendix

The data collection tool

Dependencies

Database Clean-up

The enormous variety refactoring types, projects and programming styles in the mined repositories can lead to various issues with the data. Therefore, we explain two common problems and potential solutions here.

!MAKE A COPY OF YOUR DATA BEFORE ALTERING IT!

Remove unfinished projects

The data-collection tool can fail due to various reasons, e.g. OutOfMemoryErrors, unhandled Exceptions in the mining phase, in will thus not finish all projects. If you want to rerun the data-collection or remove unfinished projects for other reasons, you can do this with the following commands:

/*
Remove refactoring and stable instances, before the projects.
DELETE
    RefactoringCommit
FROM
    RefactoringCommit
    INNER JOIN 
        CommitMetaData ON RefactoringCommit.commitmetadata_id = CommitMetaData.id
    INNER JOIN 
        ClassMetric ON refactoringcommit.classMetrics_id = ClassMetric.id
    LEFT JOIN 
        ProcessMetrics ON refactoringcommit.processMetrics_id = ProcessMetrics.id
    LEFT JOIN 
        FieldMetric ON refactoringcommit.fieldMetrics_id = FieldMetric.id
    LEFT JOIN 
        MethodMetric ON refactoringcommit.methodMetrics_id = MethodMetric.id
    LEFT JOIN 
        VariableMetric ON refactoringcommit.variableMetrics_id = VariableMetric.id
WHERE
    RefactoringCommit.project_id IN 
    (SELECT id FROM project WHERE finishedDate IS NULL);

DELETE
    StableCommit
FROM
    StableCommit
    INNER JOIN 
        CommitMetaData ON StableCommit.commitmetadata_id = CommitMetaData.id
    INNER JOIN 
        ClassMetric ON StableCommit.classMetrics_id = ClassMetric.id
    LEFT JOIN 
        ProcessMetrics ON StableCommit.processMetrics_id = ProcessMetrics.id
    LEFT JOIN 
        FieldMetric ON StableCommit.fieldMetrics_id = FieldMetric.id
    LEFT JOIN 
        MethodMetric ON StableCommit.methodMetrics_id = MethodMetric.id
    LEFT JOIN 
        VariableMetric ON StableCommit.variableMetrics_id = VariableMetric.id
WHERE
    StableCommit.project_id IN 
    (SELECT id FROM project WHERE finishedDate IS NULL);

DELETE
    project
FROM 
    project 
WHERE 
    finishedDate IS NULL;
*/

Invalid Refactorings

Refactoring-miner is a great tool and quite stable, but not perfect. Thus, in some occasions it detects enormous numbers of refactorings for single class files, these seem to be in-correct and should be marked in-valid.

/*
Mark all in-valid refactorings in the RefactoringCommit table. This allows you manually inspect potentially in-valid refactorings and decide how to handle them.
In the last line you specify the threshold of refactorings on the same commit and the same class file to be considered in-valid.
*/

UPDATE 
    RefactoringCommit
SET 
    isValid = FALSE
WHERE
    commitMetaData_id IN
    (SELECT Distinct
        commitMetaData_id
    FROM
        RefactoringCommit
    GROUP BY
        commitMetaData_id, className
    HAVING
        COUNT(className) >= 50);

Authors

This project was initially envisioned by Maurício Aniche, Erick Galante Maziero, Rafael Durelli, and Vinicius Durelli.

License

This project is licensed under the MIT license.