minosvasilias / godot-dodo

Finetuning large language models for GDScript generation.
MIT License
523 stars 25 forks source link

Other models to fine-tune #12

Open viktor-ferenczi opened 1 year ago

viktor-ferenczi commented 1 year ago

It is not an issue, more of a bunch of suggestions / observations.

While experimenting with openly available LLMs I found the following models useful for coding:

They were quite competent in coding (compared to other openly available LLMs) even when loaded as 8 bit GGML.

I think fine-tuning the new Airoboros-l2 70B one may be worth a try. However, its license is ambiguous, so careful.

Alternatively fine-tuning WizardCoder for Godot may end up as a smaller (more usable) model at surprisingly good quality.

The author of Airoboros has some very good insights on the HF page of his Airoboros-65B model. It is also worth reading the paper behind the WizardCoder model, they came up with completely synthetic training data without having to use another LLM to produce them, which is good due to licensing limitations.

viktor-ferenczi commented 1 year ago

Further observations:

A good prompt for a non-trivial coding test is the following:

You are an expert Python developer. 
Your task is to write a Python 3 function to identify duplicate files in a folder and return a summary of them.

Requirements:
- At any depth in the subdirectory structure.
- Two files are duplicates if they have the same size and contents.
- Optimization: Files with a unique size can be skipped, because they must be unique.
- File contents can be checked based on their SHA256 hashes (checksums).
- Do not read whole files into memory, calculate the hash in 32kB chunks.
- The risk of a hash collision is acceptable in this use case.
- Must find all duplicate files.
- Must NOT delete any files.
- The return value of the function must be a dictionary where keys are the file size and checksum in a tuple, values are the list of paths.
- The solution must work on both Windows and UNIX (Linux, MAC).

Further instructions:
- Add only very concise comments into the code wherever it is absolutely necessary.
- Keep the code in each function short and as simple as possible.
- Avoid deep nesting of flow control.
- Avoid assigning variables which are not used afterwards.
- Structure the code to be very easy to read and understand by humans.
- You are an expert developer, you can code this simple task very well.
- Add type hints to all function parameters, return values and variables.
- Provide only the code and nothing else.

What would be a good coding test for Godot 4's GDScript?

minosvasilias commented 1 year ago

Hey, thanks for the comprehensive thoughts and feedback, this is good stuff to read. Let me go through these one by one.

Regarding other models, i am definitely interested in Wizardcoder in particular. The reason that finetune hasn't happened yet is a mix of me not having had much time recently (due to a job change) and also the fact that i think the biggest gains in this project will at this point not come from better models, but better data. At least for now.

I still think the general idea of annotating human-created code for instruct datasets is a good one, but the current dataset has some very obvious drawbacks. The two biggest problems i believe are:

So these are issues i wanted to address before doing another training run. I've been working on this a bit and have some approaches i've been testing to fix the scope issue in particular. But yeah, wizardcoder would most likely be the next model to test. (until something more interesting comes out)

Agree with all your other points. Especially training on shader-code, that is something i'm very interested in as well. (and not just for Godot) Thinking this should probably be a separate model considering it's another language.

One problem with shader code is that i found it much harder to generate accurate instructions for it. Since all logic serves to create a visual (most of the time), and instructions in the real world for such a model would likely be things like "create a cel-shaded water shader", i think a different approach for instruction generation is needed here. Been experimenting with image-to-text models, which i think works well, though the best ones tend to not be open-source unfortunately, limiting the quality of descriptions a bit. But i've done some work on it already and will publish something if i can complete it and get some decent results.

And regarding the coding test: I didn't really want to rely on a single datapoint, which is why i went with the 50 different tests in my evaluation set. It's not perfect by any means, but already averages out a lot of bias you would otherwise find. As for the rules present in your prompt: The most common failure cases i found in GPT models that could be reduced with that kind of few-shot prompting were: