refactoring-ai / predicting-refactoring-ml

Refactoring recommendation via ML
MIT License
28 stars 8 forks source link

Machine Learning for Software refactoring

Build Status

This repository contains all our exploration on the use of machine learning methods to recommend software refactoring.

It currently contains the following projects:

Paper and appendix

The data collection tool

Compiling the tool

Use Maven: mvn clean compile. Or just import it via IntelliJ; it will know what to do.

If you want to export a jar file and run it somewhere else, just do mvn clean package. A .jar file will be created under the target/ folder. You can use this jar to run the tool manually.

To run the tests please run a local mariaDB database instance, for details see src/main/test/java/integration/DataBaseInfor for details.

Running in a manual way

You can run the data collection by simply running the RunSingleProject.java class. This class contains a program that requires the following parameters, in this order:

  1. The dataset name: A hard-coded string with the name of the dataset (e.g., "apache", "fdroid"). This information appears in the generated data later on, so that you can use it as a filter.

  2. The git URL: The git url of the project to be analyzed. Your local machine must have all the permissions to clone it (i.e., git clone url should work). Cloning will happen in a temporary directory.

  3. Storage path: The directory where the tool is going to store the source code before and after the refactoring. This step is important if you plan to do later analysis on the refactored files. The directory structure basically contains the hash of the refactoring, as well as the file before and after. The name of the file also contains the refactoring it suffered, to facilitate parsing. For more details on the name of the file, see our implementation.

  4. Database URL: JDBC URL that points to your MySQL. The database must exist and be empty. The tool will create the required tables.

  5. Database user: Database user.

  6. Database password: Database password.

  7. Store full source code?: True if you want to store the source code before and after in the storage path.

These parameters can be passed via command-line, if you exported a JAR file. Example:

java -jar refactoring.jar <dataset> <git-url> <output-path> <database-url> <database-user> <database-password> <k-threshold>

Running via Docker

The data collection tool can be executed via Docker containers. It should be as easy as:

git clone https://github.com/mauricioaniche/predicting-refactoring-ml.git
cd predicting-refactoring-ml
sudo ./run-data-collection.sh projects-final.csv 4

Configurations can be done with the following arguments:

  1. [FILE_TO_IMPORT] - Csv file with all projects
  2. [Worker_Count] - Number of concurrent worker for the data collection, running the RunQueue class
    • Optional:
      1. [DB_URL] - fully qualified url to a custom database, e.g. jdbc:mysql://db:3306/refactoringdb
      2. [DB_USER] - user name for the custom database
      3. [DB_PWD] - password for the custom database

Tip: If you are restarting everything, make sure to not import the projects again. Otherwise, you will have duplicated entries. Simply leave the file name blank in import -> environment -> FILE_TO_IMPORT.

Cleaning up the final database

When running in scale, e.g., in thousands of projects, some projects might fail due to e.g., AST parser failure. Although the app tries to remove problematic rows, if the process dies completely (e.g., maybe out of memory in your machine), partially processed projects will be in the database.

We use the following queries to remove half-baked projects completely from our database:

delete from RefactoringCommit where project_id in (select id from project where finishedDate is null);
delete from StableCommit where project_id in (select id from project where finishedDate is null);
delete from project where finishedDate is null;

The machine learning pipeline

This project contains all the Python scripts that are responsible for the ML pipeline.

Installing and configuring the database.

First, install all the dependencies:

pip3 install --user -r requirements.txt

Then, create a config.ini file, following the example structure in config-example.ini. In this file, you configure your database connection.

Training and testing models

The main Python script that generates all the models and results is the main_generate_models.py. You run it by simply calling python3 main_generate_models.py.

The script will follow the configurations in the configs.py. There, you can define which datasets to analyze, which models to build, which under sampling algorithms to use, and etc. Please, read the comments of this file.

For this script to run, you need to create a results/ folder inside the machine-learning folder. The results will be stored there.

The generated output is a text file with a weak structure. A quick way to get results is by grepping:

Before running the pipeline, we suggest you to warm up the cache. The warming up basically executes all the required queries and cache them as CSV files. These queries can take a long time to run... and if you are like us, you will most likely re-execute your experiments many times! :) Thus, having them cached helps:

python3 warm_cache.py

If you need to clean up the cache, simply delete the _cache directory that is created under the machine-learning folder.

Cross-domain validation

The main_cross_models_evaluation.py performs pairwise comparisons between two different datasets. In other words, it tests all models that were trained using a given dataset A in the data of a given dataset B. For example, it tests how Apache models behave in F-Droid models.

For this script to run, you need to create a results/ folder inside the machine-learning folder. The final CSV will be stored there. The header of the CSV is clear.

Other scripts

The main_parse_* scripts help you in parsing the resulting text file, and extract the best hyper parameters, the feature importance, and the individual k-fold results. (These are always in maintenance, as they are highly sensitive to the format that the main tool generates logs.)

Authors

This project was initially envisioned by Maurício Aniche, Erick Galante Maziero, Rafael Durelli, and Vinicius Durelli.

License

This project is licensed under the MIT license.