Comparison of different LLM fine tuning methods for Granite model against standard benchmarks

hemajv commented 1 week ago

We would like to evaluate the model performance for various LLM fine tuning approaches and compare them with the standard benchmarks. An experiment we would like to try is:

Compare the full cartesian product of fine tuning for the Granite model (medium model) with relevant combinations: {small, medium, large models} x {no pre-training + full supervised training, full supervised fine-tuning, LoRA, RAG, LoRA + RAG etc.} x {synthetic, no synthetic}. We can omit combinations that may not be relevant for our use case.
Benchmarks we can compare against (obtained from ChatGPT, we should validate these numbers with relevant published papers):

hemajv commented 1 week ago

Libraries we can try for fine tuning:

hemajv commented 1 week ago

Initial tasks:

[ ] Granite base model with MMLU dataset
[ ] Full fine tuning of Granite base model on MMLU dataset
[ ] Granite base model with few shot fine tuning on MMLU dataset
[ ] Granite base model with LoRA on MMLU dataset
[ ] Granite base model with RAG on MMLU dataset
[ ] Evaluating all the above with MMLU test data set

cc @PalmPalm7 @alekhyak1 @TreeinRandomForest @Shreyanand

redhat-et / datascience-wg

Comparison of different LLM fine tuning methods for Granite model against standard benchmarks #7