Open dromeuf opened 3 months ago
My run use unsloth 2024.8.
Building wheels for collected packages: unsloth Building wheel for unsloth (pyproject.toml) ... done Created wheel for unsloth: filename=unsloth-2024.8-py3-none-any.whl size=136912 sha256=7c408cab17d207a06e6b22693fac30b498acecb062c06f2b023b28ee42014a27 Stored in directory: /tmp/pip-ephem-wheel-cache-u3imsks4/wheels/ed/d4/e9/76fb290ee3df0a5fc21ce5c2c788e29e9607a2353d8342fd0d Successfully built unsloth Installing collected packages: sentencepiece, xxhash, unsloth, shtab, requests, pyarrow, hf-transfer, fsspec, dill, multiprocess, tyro, transformers, datasets Attempting uninstall: sentencepiece Found existing installation: sentencepiece 0.1.99 Uninstalling sentencepiece-0.1.99: Successfully uninstalled sentencepiece-0.1.99 Attempting uninstall: requests Found existing installation: requests 2.31.0 Uninstalling requests-2.31.0: Successfully uninstalled requests-2.31.0 Attempting uninstall: pyarrow Found existing installation: pyarrow 14.0.2 Uninstalling pyarrow-14.0.2: Successfully uninstalled pyarrow-14.0.2 Attempting uninstall: fsspec Found existing installation: fsspec 2024.6.1 Uninstalling fsspec-2024.6.1: Successfully uninstalled fsspec-2024.6.1 Attempting uninstall: transformers Found existing installation: transformers 4.42.4 Uninstalling transformers-4.42.4: Successfully uninstalled transformers-4.42.4
Well, a couple of things come to mind. First, your rank is pretty low, it's set to 16. So it doesn't have much room to learn new information, and will instead pick up on superficial details of the training data. Lower rank LORAs are more for getting your model to follow a certain tone or style of writing, think something like RP. To learn new knowledge via a LORA, you'd need a larger rank.
The other issue, is your model of choice, that is Llama 3.1 8b. While it's a fantastic small model, it's only 8 billion parameters. So there is a pretty hard cap to it's knowledge and general intelligence, especially when loading it in 4bit. Maybe consider something larger like Mistral 12b nemo, or Gemma 27b
@dromeuf Actually wait did you use tokenizer.apply_chat_template
? or our conversational notebooks - apologies but just after you posted, I updated all Llama 3.1 tokenizers - so probably why it's not giving you correct answers (maybe)
But @DaddyCodesAlot's analysis is also reasonable if my first point fails to improve results
Hi Dan, I am not using apply chat template. I am using my direct conversational dataset. Example in my post (example corpus qa io lines...). I try différent test and I return to you. Bests regards.
Wait when you mean direct conversational dataset - did you use our conversational style notebooks (ShareGPT / Llama -3?)
Wait when you mean direct conversational dataset - did you use our conversational style notebooks (ShareGPT / Llama -3?)
Hi Dan, @danielhanchen
I've already used your example of unsloth.chat_templates / def formatting_prompts_func(examples) code with your example for phi3 on my dataset in ShareGPT format. It's not a problem because my dataset preparation code can generate a dataset in Model Card Chat Llama3.1, ShareGPT, ALPACA... But in this calculation run, as I have a Llama3.1 Model Card version of my dataset, I don't go through your unsloth.chat_templates / def formatting_prompts_func(examples) code, and I give it my dataset as a parameter, which has the following form.
Example of my JSONL line dataset ChatML LL3.1 Instruct, based on https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/ :
{"text": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are an expert historian specializing in the ancient world and antiquity.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow many parts is Gaul divided into according to Caesar?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nthree<|eot_id|>"}
The print result (another dataset line) :
print("---- train_dataset:", train_dataset)
---- train_dataset: Dataset({
features: ['text'],
num_rows: 20837
})
print("Sample entry:\n", train_dataset[0]["text"])
Sample entry:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are an expert historian specializing in the ancient world and antiquity.<|eot_id|><|start_header_id|>user<|end_header_id|>
Who did Cassivellaunus send as an intermediary to negotiate a surrender with Caesar?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Commius the Atrebatian<|eot_id|>
The same first example in my ShareGPT dataset JSON file version :
[
{
"conversations": [
{
"from": "system",
"value": "You are an expert historian specializing in the ancient world and antiquity."
},
{
"from": "human",
"value": "How many parts is Gaul divided into according to Caesar?"
},
{
"from": "gpt",
"value": "three"
}
]
},
]
Do you think that my Llama 3.1 model card version is bad and that I'd be better off using my ShareGPT dataset version with your conversion function unsloth.chat_templates / def formatting_prompts_func(examples) ?
and obviously reload the latest version of unsloth since you explain in your reply that just after my question you updated unsloth's llama 3.1 tokenizer ? There are also @DaddyCodesAlot parameter improvements, but I'll do that later to evaluate the improvements...
Thanks for advance, David.
@dromeuf Actually wait did you use
tokenizer.apply_chat_template
? or our conversational notebooks - apologies but just after you posted, I updated all Llama 3.1 tokenizers - so probably why it's not giving you correct answers (maybe)But @DaddyCodesAlot's analysis is also reasonable if my first point fails to improve results
@danielhanchen @DaddyCodesAlot I've used my ShareGPT version of my dataset with unsloth's llama-3 model card formatting function get_chat_template() and I've also increased the rank r 16 to 64, to a matrix of r=64, lora_alpha=64, lora_dropout=0.1, but the result is pretty much the same. The answers to my queries about inferring model 16 merged are not very good and poor.
IMPORTANT : I've checked my train and validation dataset for the inference test question I'm asking on book 6 chapter 21. In my train finetuning dataset, I have 41 questions-answers or instructions-outputs, including the complete text corpus in English, French and Latin (3), so 38 QA or IO, including QA on their gods Sun, Moon and Vulcan/Fire :
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are an expert historian specializing in the ancient world and antiquity.<|eot_id|><|start_header_id|>user<|end_header_id|>
Recognize the entities mentioned in the text as gods<|eot_id|><|start_header_id|>assistant<|end_header_id|>
The sun, fire, and the moon<|eot_id|>
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are an expert historian specializing in the ancient world and antiquity.<|eot_id|><|start_header_id|>user<|end_header_id|>
Which gods do the Germans believe in?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
The sun, fire, and the moon, as they are the only deities they can behold and by whose instrumentality they are obviously benefited.<|eot_id|>
In my validation dataset part, I have 11 QA & IO.
I'm still surprised by the level of response to my inference questions. As DaddyCodesAlot suggested, Llama 3.1 instruct is probably not strong enough to learn this type of dataset ? or another probem ?
max_seq_length = 2048
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
max_seq_length=max_seq_length,
load_in_4bit=True,
dtype=None,
)
model = FastLanguageModel.get_peft_model(
model,
r=64,
lora_alpha=64,
lora_dropout=0.1,
target_modules=["q_proj", "k_proj", "v_proj", "up_proj", "down_proj", "o_proj", "gate_proj"],
use_rslora=True,
use_gradient_checkpointing="unsloth"
)
tokenizer = get_chat_template(
tokenizer,
chat_template = "llama-3",
mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"},
)
def formatting_prompts_func(examples):
convos = examples["conversations"]
texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
return { "text" : texts, }
pass
logs :
Successfully installed datasets-2.21.0 dill-0.3.8 hf-transfer-0.1.8 multiprocess-0.70.16 pyarrow-17.0.0 sentencepiece-0.2.0 shtab-1.7.1 transformers-4.44.0 tyro-0.8.8 unsloth-2024.8 xxhash-3.5.0
Building wheel for unsloth (pyproject.toml) ... done
Created wheel for unsloth: filename=unsloth-2024.8-py3-none-any.whl size=143194 sha256=40f1749bf76faecfd472f3bd627845aee2e575ccc488df82bbdd93122675006f
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))== Unsloth 2024.8: Fast Llama patching. Transformers = 4.44.0.
\\ /| GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \ Pytorch: 2.3.1+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\ / Bfloat16 = TRUE. FA [Xformers = 0.0.26.post1. FA2 = False]
Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.1.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2024.8 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.
Paramètres du model préparé pour PEFT :
trainable params: 167,772,160 || all params: 8,198,033,408 || trainable%: 2.0465
None
print("Sample entry:\n", train_dataset[0]["text"])
Sample entry:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are an expert historian specializing in the ancient world and antiquity.<|eot_id|><|start_header_id|>user<|end_header_id|>
Who did Cassivellaunus send as an intermediary to negotiate a surrender with Caesar?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Commius the Atrebatian<|eot_id|>
GPU = NVIDIA A100-SXM4-40GB. Max memory = 39.564 GB.
6.457 GB of memory reserved.
==((====))== Unsloth - 2x faster free finetuning | Num GPUs = 1
\\ /| Num examples = 2,410 | Num Epochs = 2
O^O/ \_/ \ Batch size per device = 4 | Gradient Accumulation steps = 4
\ / Total batch size = 16 | Total steps = 300
"-____-" Number of trainable parameters = 167,772,160
-- Training completed stats :
TrainOutput(global_step=300, training_loss=0.32471315984924637, metrics={'train_runtime': 4099.8106, 'train_samples_per_second': 1.176, 'train_steps_per_second': 0.073, 'total_flos': 4.523655094264136e+17, 'train_loss': 0.32471315984924637, 'epoch': 1.9900497512437811})
-- Evaluation validation dataset results :
{'eval_loss': 0.20628409087657928, 'eval_runtime': 107.9507, 'eval_samples_per_second': 4.863, 'eval_steps_per_second': 0.611, 'epoch': 1.9900497512437811}
----INFERENCE TEST----
print(model)
PeftModelForCausalLM(
(base_model): LoraModel(
(model): LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(128256, 4096)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): lora.Linear4bit(
(base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.1, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=4096, out_features=64, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=64, out_features=4096, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(k_proj): lora.Linear4bit(
(base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.1, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=4096, out_features=64, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=64, out_features=1024, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(v_proj): lora.Linear4bit(
(base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.1, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=4096, out_features=64, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=64, out_features=1024, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(o_proj): lora.Linear4bit(
(base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.1, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=4096, out_features=64, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=64, out_features=4096, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(rotary_emb): LlamaExtendedRotaryEmbedding()
)
(mlp): LlamaMLP(
(gate_proj): lora.Linear4bit(
(base_layer): Linear4bit(in_features=4096, out_features=14336, bias=False)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.1, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=4096, out_features=64, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=64, out_features=14336, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(up_proj): lora.Linear4bit(
(base_layer): Linear4bit(in_features=4096, out_features=14336, bias=False)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.1, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=4096, out_features=64, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=64, out_features=14336, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(down_proj): lora.Linear4bit(
(base_layer): Linear4bit(in_features=14336, out_features=4096, bias=False)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.1, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=14336, out_features=64, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=64, out_features=4096, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((4096,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=4096, out_features=128256, bias=False)
)
)
)
Inference tests prompts poor :
[{'from': 'system', 'value': 'You are an expert historian specializing in the ancient world and antiquity.'}, {'from': 'human', 'value': 'What are the customs and gods of the Germans in chapter 21 of book 6 of the Gallic War?'}]
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are an expert historian specializing in the ancient world and antiquity.<|eot_id|><|start_header_id|>user<|end_header_id|>
What are the customs and gods of the Germans in chapter 21 of book 6 of the Gallic War?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
They worship gods they have not seen, and have customs very different from the Gallic people<|eot_id|>
[{'from': 'system', 'value': 'You are an expert historian specializing in the ancient world and antiquity.'}, {'from': 'human', 'value': "Que s'est-il passé à Gergovie en 52 BC ?"}]
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are an expert historian specializing in the ancient world and antiquity.<|eot_id|><|start_header_id|>user<|end_header_id|>
Que s'est-il passé à Gergovie en 52 BC?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Les habitants ont pris les armes, massacrés les citoyens romains et pillé leurs biens.<|eot_id|>
@dromeuf Oh could you try chat_template = "llama-31",
which is for Llama 3.1 - I just added it in last week.
On inference, did you enable FastLanguageModel.for_inference(model)
before doing inference?
@dromeuf Oh could you try
chat_template = "llama-31",
which is for Llama 3.1 - I just added it in last week.
I did wonder about that, but I hadn't seen it in the doc notebook yesterday. I have a run in progress with the same dataset but using Mistral Nemo 12B Instruct. I'll relaunch with chat_template=“llama-31” tomorrow.
On inference, did you enable
FastLanguageModel.for_inference(model)
before doing inference?
Yes.
Kind regards.
@dromeuf Oh could you try
chat_template = "llama-31",
which is for Llama 3.1 - I just added it in last week.
Hi Dan @danielhanchen , I changed the chat_template to “llama-31”. I took the opportunity to change the rank to r=256 for this new run, but the result is worse. In fact a date now appears in the conversation system of my dataset ”Cutting Knowledge Date: December 2023 Today Date: 26 July 2024" (see logs below) !!!. One of the inference responses also contains the date !
tokenizer = get_chat_template(
tokenizer,
chat_template = "llama-31",
mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"},
)
def formatting_prompts_func(examples):
convos = examples["conversations"]
texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
return { "text" : texts, }
pass
print() : Sample entry :
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
Cutting Knowledge Date: December 2023
Today Date: 26 July 2024
You are an expert historian specializing in the ancient world and antiquity.<|eot_id|><|start_header_id|>human<|end_header_id|>
Who did Cassivellaunus send as an intermediary to negotiate a surrender with Caesar?<|eot_id|><|start_header_id|>gpt<|end_header_id|>
Commius the Atrebatian<|eot_id|>
logs :
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))== Unsloth 2024.8: Fast Llama patching. Transformers = 4.44.0.
\\ /| GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \ Pytorch: 2.3.1+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\ / Bfloat16 = TRUE. FA [Xformers = 0.0.26.post1. FA2 = False]
"-____-" Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.1.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2024.8 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.
Paramètres du model préparé pour PEFT :
trainable params: 671,088,640 || all params: 8,701,349,888 || trainable%: 7.7125
None
GPU = NVIDIA A100-SXM4-40GB. Max memory = 39.564 GB.
8.332 GB of memory reserved.
==((====))== Unsloth - 2x faster free finetuning | Num GPUs = 1
\\ /| Num examples = 2,624 | Num Epochs = 2
O^O/ \_/ \ Batch size per device = 4 | Gradient Accumulation steps = 4
\ / Total batch size = 16 | Total steps = 328
"-____-" Number of trainable parameters = 671,088,640
-- Training completed stats :
TrainOutput(global_step=328, training_loss=5.754552464659621, metrics={'train_runtime': 4624.7583, 'train_samples_per_second': 1.135, 'train_steps_per_second': 0.071, 'total_flos': 5.272500370805883e+17, 'train_loss': 5.754552464659621, 'epoch': 2.0})
-- Evaluation validation dataset results :
{'eval_loss': 2.709500551223755, 'eval_runtime': 121.7538, 'eval_samples_per_second': 4.731, 'eval_steps_per_second': 0.591, 'epoch': 2.0}
INFERENCE TEST (completely wrong answer with a date that appears in the system part of my question ??? !!!) :
[{'from': 'system', 'value': 'You are an expert historian specializing in the ancient world and antiquity.'}, {'from': 'human', 'value': 'What are the customs and gods of the Germans in chapter 21 of book 6 of the Gallic War?'}]
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
Cutting Knowledge Date: December 2023
Today Date: 26 July 2024
You are an expert historian specializing in the ancient world and antiquity.<|eot_id|><|start_header_id|>human<|end_header_id|>
What are the customs and gods of the Germans in chapter 21 of book 6 of the Gallic War?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Commenthuman<|eot_id|>
[{'from': 'system', 'value': 'You are an expert historian specializing in the ancient world and antiquity.'}, {'from': 'human', 'value': "Que s'est-il passé à Gergovie en 52 BC ?"}]
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
Cutting Knowledge Date: December 2023
Today Date: 26 July 2024
You are an expert historian specializing in the ancient world and antiquity.<|eot_id|><|start_header_id|>human<|end_header_id|>
Que s'est-il passé à Gergovie en 52 BC?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
<|start_header_id|>human<|end_header_id|>
<|eot_id|>
Another problem decidedly ;-) :
@danielhanchen Dan I ran a first computation run with Mistral Nemo 12B Instruct, but the inference results are curious too (there is a multiplication of characters "the the the the the the the the the the the the the ,,,,,,,,,,,,,,, ..."). I didn't add EOS_TOKEN afterwards in the function to convert my ShareGPT dataset to chat_template=“mistral”. Looking at your Mistral Nemo notebook, with the example for en ALPACA, I've just seen that you need to add EOS_TOKEN. In your blog you talk about a fix in progress. Do I still need to add EOS_TOKEN ?
Example of print() an entry in my dataset converted from ShareGPT to Mistral Nemo 12B Instruct by your function tokenizer.apply_chat_template() and which does contain an EOS token though </s>
:
<s>[INST] You are an expert historian specializing in the ancient world and antiquity. Who did Cassivellaunus send as an intermediary to negotiate a surrender with Caesar? [/INST]Commius the Atrebatian</s>
Inference test Mistral Nemo 12B Instruct :
[{'from': 'system', 'value': 'You are an expert historian specializing in the ancient world and antiquity.'}, {'from': 'human', 'value': 'What are the customs and gods of the Germans in chapter 21 of book 6 of the Gallic War?'}]
<s>[INST] You are an expert historian specializing in the ancient world and antiquity. What are the customs and gods of the Germans in chapter 21 of book 6 of the Gallic War? [/INST] the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, and, the, the, the, the, the, the, the, the, the, the, the, the, and, the,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
@dromeuf The date is correct - I asked the HF team to enquire about this, but Llama 3.1 Instruct's system prompt has the date.
Would it be possible for you to share the exact scripts of how you train these models (dataset part can be removed), so I can investigate further on my side - appreciate it a lot :)
Something seems wrong with the Mistral finetune as well
@dromeuf The date is correct - I asked the HF team to enquire about this, but Llama 3.1 Instruct's system prompt has the date.
So there ! ;-) that's a unusual surprise.
Would it be possible for you to share the exact scripts of how you train these models (dataset part can be removed), so I can investigate further on my side - appreciate it a lot :)
Something seems wrong with the Mistral finetune as well
@danielhanchen Hi Dan, I've sent you an email to your Gmail inbox with all the links to my COLAB Notebook and Google Drive DataSet ShareGPT and Llama 3 model card.
In the case of this github thread, the problem must stem from a mistake I'm making despite not to bad training_loss & eval_loss. I'm reading a lot of blog posts with your tutorial notebook code (or that don't use unsloth), but not benchmarking afterwards, just testing inference on a question whose pre-trained model already knows the answer. I'd like to write a script that reprompts the finetuned inference of all the Q-A or I-O in my dataset and analyzes and scores the responses of the model I've finetuned... I think with this approach I'll get a real evaluation of my finetuning. Do you recommend this type of benchmark or another approach, another tool ?
Many thanks Dan for debugging as you must be very very busy.
King regards, David.
@dromeuf Oh saw your email - will definitely take a look
@dromeuf Oh saw your email - will definitely take a look
Hi Dan @danielhanchen , were you able to work with my dataset and notebook to solve or identify the problem ? Kind regards,
@dromeuf Much apologies I did not have a chance - will try over the weekend sorry :(
@dromeuf Much apologies I did not have a chance - will try over the weekend sorry :(
@danielhanchen no problem Dan, the important thing is to understand what's going on, my error, dataset problem, 8B too low, other problems detected... so that we can move forward and other unsloth users can benefit from your expertise. I understand perfectly that you're very busy.
@danielhanchen Hi Dan, you still haven't had time to assess my finetune problem ? King regards,
@danielhanchen Hello Dan, is your last blog Gradient Accumulation fix message sufficiently related to my problem https://unsloth.ai/blog/gradient ? Is it worth launching a new calculation run on my dataset to evaluate the new version and function or should I wait for your complete investigation ? Kind regards,
@dromeuf It might be related, but unsure. Also apologies haven't gotten to it - unfortunately it totally slipped me - I'll ask someone to take over and see if they can help replicate the issue - if you want to rerun it, you'll have to first use the nightly transformers version -
You can continue using unsloth_train
or just use trainer.train()
as in the blog:
!pip uninstall unsloth -y
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip uninstall transformers -y && pip install --upgrade --no-cache-dir "git+https://github.com/huggingface/transformers.git"
@danielhanchen Dan, I ran the update of unsloth and transformers and then a new calculation run on my dataset with the code I sent you. The situation is identical, the response is delirious.
I wanted to test with your example and your new notebook for Llama 3.2 3B Instruct. I hardly changed any of your code and ran the training on my dataset. The response is a little less crazy, but not at all convincing. In your example, if I use the first inference without TextStreamer, nothing comes out at all (blank). If I use your inference code with TextStreamer then something comes out (but not convincingly).
I don't know what's going on with my dataset. Maybe it's too heavy for 3.1 8B and 3.2 3B because it contains texts in English, French and Latin ? I'm going to try injecting only English texts into a new learning run to see what happens.
@danielhanchen Dan, I've simplified my dataset by leaving only English texts, so I've removed the French and Latin texts. I re-run a learning calculation run on my Llama 3.1 8B Instruct code and your Llama 3.2 3B Instruct notebook code. Unfortunately, it's the same thing: the results of the inference questions are not good, if not delirious.
ex TextStreamer :
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
Cutting Knowledge Date: December 2023 Today Date: 26 July 2024
You are an expert historian specializing in the ancient world and antiquitys.<|eot_id|><|start_header_id|>human<|end_header_id|>
What are the customs and gods of the Germans in chapter 21 of book 6 of the Gallic War?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
provide, You<|eot_id|>
Maybe consider open sourcing your dataset so that people can take a crack at it. My gut feeling is that there are 3 issues that are all contributing to this error.
1) You're using quantization of low parameter models, and thus increasing perplexity. 2) There's a rank issue. LORAs for small models at low ranks do not learn new information, but rather adapt the format and overfit. 3) There's an issue with the training process, maybe with certain parameters being off.
@DaddyCodesAlot My dataset isn't completely finished yet, and I wanted to complete and optimize it, particularly with regard to the proper names of the Romans, to make the characters perfectly identifiable and avoid homonyms. That's why I didn't want to publish a version that I think can be improved and completed. But if someone wants to work on it and, above all, benchmark the results of learning (which I don't see much of in the many publications), I'm willing to give them the rights to my google drive directory and my notebooks... you can send me an e-mail on my gmail (simple first.last@gmail.com). I'd love to be able to use unsloth for a larger project but for the moment there's a problem somewhere. I agree with your 3 points. Kind regards, David Romeuf.
Dubious! I'll start my explanation with this deliberately provocative adjective to make progress on the subject and find my mistake.
On the web you can see a craze for finetuning (with unsloth or other but I haven't tested it yet) and you can find many blog articles that often use the same codes (based on Dan's) with a few variations depending on the SFTTrainer() and dataset parameters...
I wanted to finetune llama 3.1 8B instruct 4bnb (and standard version, but the results are identical) on an ancient book I'm working on, "De Bello Gallico", in the hope that my finetuned model could provide a relatively accurate answer to the question "What are the customs and gods of the Germans in chapter 21 of book 6 of the Gallic War?".
I chose this question because GPT4o, Llama 3.0, 3.1, 8B 70B on groq, Gemini deal with approximation and big errors. Only Claude 3.5 sonnet answers correctly enough, at least that's the best answer.
My idea was therefore to specialize llama 3.1 (or other) on this corpus text (although RAG may be better direct suited to my situation). To do this, I created my own dataset which injects the text all corpus text in English, French and Latin, and a series of more 20.000 instruction-output questions and answers in English and French, split into a train dataset and an evaluation dataset (just QA and IO not corpus text for eval dataset). The results of train and eval by SFTTrainer() on 2 epochs :
-- Training completed stats : TrainOutput(global_step=300, training_loss=0.37135863726337753, metrics={'train_runtime': 3545.2205, 'train_samples_per_second': 1.36, 'train_steps_per_second': 0.085, 'total_flos': 4.449468983153787e+17, 'train_loss': 0.37135863726337753, 'epoch': 1.9900497512437811})
-- Evaluation validation dataset results : {'eval_loss': 0.22035124897956848, 'eval_runtime': 93.5197, 'eval_samples_per_second': 5.614, 'eval_steps_per_second': 0.706, 'epoch': 1.9900497512437811}
I initiated a fine-tuning run for 2 epochs on a Colab notebook using an A100 GPU and obtained a model that I saved in the Hugging Face merged 16-bit format. I was able to perform inference using various methods: transformers.pipeline(), transformers model.generate() , and differents parameters like : do_sample=False (temperature=None top_p=None top_k=None), or do_sample=True (with temperature 0.7 top_p=0.9 top_k=50 in this case). I also converted the Hugging Face merged 16-bit model to the GGUF format and ran the same inference query on Ollama.
However, the responses to the prompt from these various inference (runs of my fine-tuned 3.1 8B Instruct 4 bnb model) were incorrect, very bad or unsatisfactory.
I am seeking to understand the root of the issue and have several, perhaps provocative, questions :
Are my fine-tuning parameters and dataset appropriate (see details below)? While they could be improved, are two epochs insufficient?
Does my dataset contain enough question-answer pairs or instruction-output to fine-tune this corpus text ? It's not easy to generate more QA IO...
I am disappointed with the responses to a specific question I used as an example, but I would like to benchmark all questions and instructions in my dataset against the entire text. Do you know of a convenient tool to do this relatively easily ? I have started asking Claude to evaluate the responses of my fine-tuned model based on the English and French text corpus provided in the prompt. While the evaluation seems fairly good, benchmarking manually without a script is impractical... but I could write a script.
All articles on the web provide examples of simple inference questions in the code, but I have yet to see a complete benchmark of the results. Are their results ultimately good ? I believe it would be beneficial to conduct a comprehensive benchmark of the results afterwards.
Is QLora/Lora 4-bit fine-tuning too simplistic for this kind of work ? RAG ? But in that case, what are the practical applications of these quickly quantized fine-tuned models ?
Thanks for advance, David.