Finetuning code llama for Multi-File code generation on private repository

Hello. I'm trying to finetune code llama for a multifile code generation task on my private repository. The goal is to have the LLM generate code for some common bugs / issues across multiple files in my private repository.

Based on what I have been able to understand so far, the assumption is that doing this will require multiple stages of training / fine-tuning. I read the CodeLlama paper and am trying to create my own "specialization pipeline" for my repository and tasks.

1) The first fine-tuning will be done to give the model some comprehension about the repository structure (file paths, summary of what the file is doing and the code itself). This will require 100% code-coverage and the goal would be to have the model overfit. In this case, we will only look at model loss and will have no evaluation or test data sets.

2) Once the model has some comprehension about the repository structure, a second-pass task-specific fine-tuning can be done on a much smaller dataset which will be specific to the task. E.g. We can have the issues, old-code and refactored code as our dataset fields. We can then check for model loss, evaluation loss and the test results to measure the performance of the model.

The reason I want to do it this way is that while the fixes (the fixed code) is common, the files in which the code has to change might be different. So, the model needs to have some understanding of the files etc present in the repository.

Does this approach sound good or feasible? Are there alternative ways of doing this? If so, would you be able to point me to some resources that I can read and learn from.

Thanks.

meta-llama / codellama

Finetuning code llama for Multi-File code generation on private repository #96