Software Code Bug-Fixing using Unsupervised Learning and Error Detection

Aayush121202 commented 2 months ago

Title

Generative AI based Software Metadata Classification

Team Name

Data Pirates

Email

202101452@daiict.ac.in

Team Member 1 Name

Aayush Patel

Team Member 1 Id

202101452

Team Member 2 Name

Pranav Patel

Team Member 2 Id

202103040

Team Member 3 Name

Vatsal Shah

Team Member 3 Id

202103022

Team Member 4 Name

Kalp Shah

Team Member 4 Id

202103003

Problem Statement

A binary code comment quality classification model needs to be augmented with generated code and comment pairs that can improve the accuracy of the model.

Evaluation Strategy

Evaluated based on the % increase in F1 score from baseline and the quality of data generated.

Dataset

Seed Dataset: A dataset of code and comment pairs- https://drive.google.com/file/d/17caOWv0F_0W7q9IHnMP_uGEuzSs1am0h/view

Resources

Paper Title- Software Metadata Classification based on Generative Artificial Intelligence Paper Link- https://arxiv.org/pdf/2310.13006

parth126 commented 2 months ago

Problem might be overdependent on LLMs and unfeasible because of lack of compute
One possibility is fine-tuning a small lamma model (something that works on the laptop) for a specific task
Todos: Look for existing datasets, more papers in this area and contact the IRSE team for data of track 2.

parth126 commented 1 month ago

The problem is changed to fixing errors in code using seq2seq model. Reference papers:

Break-It-Fix-It: Unsupervised Learning for Program Repair
An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation

The team is supposed to add the following details:

Availability of the original code used in the paper.
Availability of the perturbed code used in the paper / or existing implementation for introducing the pertrubations
evaluation metrics
compute requirement

Aayush121202 commented 1 month ago

Sir, We have changed our topic as per your suggestion:

New Problem Statement: Software Code Bug-Fixing through Unsupervised Learning and Error Detection

Project Explanation: The project centers on using the Break-It-Fix-It (BIFI) algorithm, which automates the cycle of breaking and fixing code. The algorithm consists of 2 main parts- the fixer and the critic. The fixer is a model trained to repair synthetically corrupted code and gradually improves through feedback from the critic—a compiler or code analyzer. The critic evaluates the fixer's output, determining whether the repaired code is error-free. Over time, this process allows the fixer to learn from real-world errors, becoming more accurate in repairing code without needing labeled data. This innovative method enhances automated bug fixing and offers practical applications in software development and education.

Evaluation Strategy: We will evaluate the model based on repair accuracy, calculated as the proportion of successfully repaired code snippets that compile without errors. The model's performance will also be assessed across specific error types (e.g., syntax errors, indentation errors) using F1 scores for each class.

Dataset : The project will utilize two main datasets as referenced in the paper: GitHub-Python: A dataset of 3 million Python code snippets, with 38,000 examples of bad code. DeepFix: A dataset containing C code submitted by students, consisting of 7,000 bad examples and 37,000 good examples. The dataset consisting of good code and bad code is available here- https://worksheets.codalab.org/bundles/0x5eb0135755464c66bf3c398f43f634e0 The dataset is too large, so we will need to extract limited codes from the dataset to train the model.

Resources : Research Paper : Break-It-Fix-It: Unsupervised Learning for Program Repair https://arxiv.org/pdf/2106.06600

We look forward to your feedback.

parth126 commented 1 month ago

Looks good. Marking this as approved.

parth126 / IT550