Closed ammarali32 closed 1 month ago
@ammarali32 You're correct - the issue is in fact my code - I forgot that generation's KV cache handling is different from training....... OOPS
["<s> Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8, in ,. Home1 (,\n ( P R, ( inЉa...2 in , ,. Ch,, Pin1 , ’s P( ( T. Home\n , (. Ever. P, in W...Љi:1 M I\n Chl\n Ch/ (l. A:,l, !s:in'1B’ ::’ . The-,\n / to ( T P \n P ( , /, ,\n Ch ( to. Home a \n R ( , A1 , a , by , lЋ, T' I\n T. P s ' /. Home (\n U ...; toЉ\n F a!' by...; ss . Home (l l ,a- I to\n -\n ( (,,, RS (- . Home...\n Home, \n Home /\n The (,a T ( T- (\n New\n by’. Home. Home inЉ C. Home"]
when it's supposed to be
['<s> Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121417, 196835, 322587, 530487, 864030, 1398269, 2256353, 3622777, 5988889, 9872040, 16099689, 26632467, 43321641, 70211837, 112378173, 183084336, 2942']
@ammarali32 I fixed it!!! It would be awesome if you can try it out! I also updated the Alpaca Colab example https://colab.research.google.com/drive/1oW55fBmwzCOrBVX66RcpptL3a99qWBxb?usp=sharing
Unsloth 2024.1 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers. ['<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n ### Instruction:\n Continue the fibonnaci sequence.\n\n ### Input:\n 1, 1, 2, 3, 5, 8\n\n ### Response:\n 1. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,']
I got the same error above, do you have any suggestions for me to fix this
@namtranase Did you upgrade Unsloth to the current latest or is this on Colab? I resolved this issue which popped up again unfortunately a few days ago - apologies!
Thank you for your quick response @danielhanchen, I pull the latest and follow the installation steps for my T4 device:
Run the file:
test_unsloth_model.py
For this test, I used the TinyLlama-1.1B-Chat-v1.0 model
And the output is:
['<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n ### Instruction:\n Continue the fibonnaci sequence.\n\n ### Input:\n 1, 1, 2, 3, 5, 8\n\n ### Response:\n 1, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, \n \n \n \n 1, 1, 1, the sequence:\n \n\n\n\n\n\n, 1, 1, 1, 1, 1, 1, 1, 1, the sequence:\n \n 1, 1, 1, 1, the task:\n \n \n \n \n \n \n \n \n \n \n 1,']
Is there something wrong with the test file, can you check and give me your feedback, thank you!
@namtranase Hey sorry back! For TinyLlama, you'll have to follow exactly their prompt format:
# Install transformers from source - only needed for versions <= v4.34
# pip install git+https://github.com/huggingface/transformers.git
# pip install accelerate
import torch
from transformers import pipeline
pipe = pipeline("text-generation", model="TinyLlama/TinyLlama-1.1B-Chat-v1.0", torch_dtype=torch.bfloat16, device_map="auto")
# We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
messages = [
{
"role": "system",
"content": "You are a friendly chatbot who always responds in the style of a pirate",
},
{"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
# <|system|>
# You are a friendly chatbot who always responds in the style of a pirate.</s>
# <|user|>
# How many helicopters can a human eat in one sitting?</s>
# <|assistant|>
# ...
@danielhanchen Thanks everything seems to be fine, will you support mapping the model on multi-gpus even if as sequential blocks for training?!
The output improved after I changed the prompt to the chat template.
Using the default (you provide above):
<|system|> You are a friendly chatbot who always responds in the style of a pirate</s> <|user|> How many helicopters can a human eat in one sitting?</s> <|assistant|> According to a study published in the Journal of the American Medical Association (JAMA), a human can eat around 500-600 calories per hour. However, the amount of calories consumed by a human depends on factors such as age, gender, physical activity level, and dietary habits. It is not possible to accurately determine how many helicopters a human can eat in one sitting based on this study.
Using my script (after changing to chat template):
['<|system|>\nYou are a friendly chatbot who always responds in the style of a pirate</s> \n<|user|>\nHow many helicopters can a human eat in one sitting?</s> \n<|assistant|>\nThere is no specific number of helicopes that a human beings that a human beings that a human beings that a human beings of helicers who can a human beings of helicers who can a human beings of helic helic helicers whoever, as there is a human beast a human be a human beast a human beast a human beast a human beast a human beastrobot is a human beverballoveget a human be a human beastrophet is a human:\n|user|user|user|user|user|user']
Some steps I would take to improve my experiments:
get_peft_model
makes the output different from the original
If you think I need to try more approaches, please let me know, really enjoy playing with the repo@ammarali32 Oh it's not yet supported - we're working on making it out for our next release :)
@namtranase Oh wait did you get gibberish after finetuning or before finetuning. If after finetuning, I suggest you directly use the chat template for finetuning, and not Alpaca. That's because the model you are using is realdy finetuned. I would use the non chat version if you want Alpaca style.
Yes thank @danielhanchen, I add the do_sample=True, temperature=0.7, top_k=50, top_p=0.95
to the test script before finetuning and it output correctly. Will be updating my model based on the chat template. I think we can close this issue.
@namtranase Oh not yet :) I actually am working on making inference 2-4x faster, which I might push in an hour :) Hopefully if you can, it would be wonderful if you could test it out :))
@namtranase Ok not in an hour whoops - probably in the next few days!! :(
It's ok man, your speed is still god 💯
@namtranase @ammarali32 Just pushed a 2x faster inference on the main branch!! :) Hope you can try it out :)) It natively makes inference faster without any tricks - ie num_beams, batched etc all are faster :)
Call FastLanguageModel.for_inference(model)
before doing inference to make it faster :)
Call FastLanguageModel.for_training(model)
to revert it back for finetuning.
https://github.com/unslothai/unsloth/assets/23090290/dab1ea44-34bc-4585-819f-3621614ff871
Can you share a bit about how to speed up (if it is your secret source then no need), I check it and it faster, only need to apply FastLanguageModel.for_inference(model)
`
Before:
Inference time: 9.283933162689209 seconds
After: Inference time: 8.08155345916748 seconds `
@namtranase Oh interesting its not as fast as I would have hoped :)) Oh its open source - you can inspect the code however you like :))
I would have expected 4s LOL - are you doing 4bit or 16bit and would it be possible you could tell me a bit more on your hardware :)))
Closing for now!
Hello, thanks for your contribution it is really promising but for some reason it breaks the generation and inference Here is an example:
The output:
I have tried more than 4 different Llamas including yours and the same issue.