minosvasilias / godot-dodo

Finetuning large language models for GDScript generation.
MIT License
523 stars 25 forks source link

Practical scope requirements #1

Open StoneCypher opened 1 year ago

StoneCypher commented 1 year ago

How much code in micro-language Foo do you actually need to train one of these?

minosvasilias commented 1 year ago

The dataset used for the provided weights was 60k rows. Each scraped script is split into individual functions as an easy and reliable way to split code into chunks, so one function = one entry.

In practice, this resulted in 762 repositories being parsed for the training data, see godot_dodo_4x_60k_repos.json

StoneCypher commented 1 year ago

if you were asked to stick your thumb in the air and guess, what would you expect a lower bound for practical success to be?

my language is nowhere near that common

minosvasilias commented 1 year ago

I would say the lower bound of dataset sizes i've seen for LLaMA finetunes in general (not code-specific) sits around 15-20k rows.

I personally trained a 20k rows 7b model initially to judge whether or not this project was worth pursuing, but don't have any evaluations for that one. Still, it showed good enough results to continue, so that would be the sort of minimum i'd be looking at.