unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
18.03k stars 1.25k forks source link

Unsloth breaks the inference ?! #21

Closed ammarali32 closed 1 month ago

ammarali32 commented 11 months ago

Hello, thanks for your contribution it is really promising but for some reason it breaks the generation and inference Here is an example:

from unsloth import FastLlamaModel
import torch
max_seq_length = 1024 # Can change to any number <= 4096
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# Load Llama model
model, tokenizer = FastLlamaModel.from_pretrained(
    model_name = "TheBloke/Llama-2-7B-fp16", # Supports any llama model
    max_seq_length = max_seq_length,
    dtype=dtype,
    load_in_4bit = False
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
inputs = tokenizer.encode("the concept of ", return_tensors="pt", add_special_tokens = True).to(model.device)
answer = model.generate(inputs, max_new_tokens = 20)
tokenizer.batch_decode(answer, skip_special_tokens = False)

The output:

==((====))==  Unsloth: Fast Llama patching release 23.11
   \\   /|    GPU: A100-SXM4-40GB. Max memory: 39.587 GB
O^O/ \_/ \    CUDA compute capability = 8.0
\        /    Pytorch version: 2.1.0+cu118. CUDA Toolkit = 11.8
 "-____-"     bfloat16 support = TRUE

Loading checkpoint shards: 100%
2/2 [00:14<00:00, 6.61s/it]
['<s> the concept of 1<s> Tags\n \\\n \\\n \\\n \\\n \\\n \\\n \\\n \\\n']

I have tried more than 4 different Llamas including yours and the same issue.

danielhanchen commented 11 months ago

@ammarali32 You're correct - the issue is in fact my code - I forgot that generation's KV cache handling is different from training....... OOPS

["<s> Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8, in ,. Home1 (,\n  ( P R, (  inЉa...2 in , ,. Ch,, Pin1    , ’s P(  ( T. Home\n , (. Ever. P, in W...Љi:1  M I\n Chl\n Ch/ (l. A:,l,  !s:in'1B’  ::’ . The-,\n / to ( T  P \n P (   , /,  ,\n Ch  ( to. Home a \n R (  , A1 ,  a , by , lЋ, T' I\n T. P s ' /. Home (\n U ...; toЉ\n F a!' by...; ss . Home (l l ,a- I to\n -\n  (   (,,,  RS (-  . Home...\n Home, \n Home /\n The (,a T ( T- (\n New\n by’. Home. Home inЉ C. Home"]

when it's supposed to be

['<s> Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121417, 196835, 322587, 530487, 864030, 1398269, 2256353, 3622777, 5988889, 9872040, 16099689, 26632467, 43321641, 70211837, 112378173, 183084336, 2942']
danielhanchen commented 11 months ago

@ammarali32 I fixed it!!! It would be awesome if you can try it out! I also updated the Alpaca Colab example https://colab.research.google.com/drive/1oW55fBmwzCOrBVX66RcpptL3a99qWBxb?usp=sharing

image

namtranase commented 9 months ago

Unsloth 2024.1 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers. ['<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n ### Instruction:\n Continue the fibonnaci sequence.\n\n ### Input:\n 1, 1, 2, 3, 5, 8\n\n ### Response:\n 1. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,']

I got the same error above, do you have any suggestions for me to fix this

danielhanchen commented 9 months ago

@namtranase Did you upgrade Unsloth to the current latest or is this on Colab? I resolved this issue which popped up again unfortunately a few days ago - apologies!

namtranase commented 9 months ago

Thank you for your quick response @danielhanchen, I pull the latest and follow the installation steps for my T4 device:

Run the file: test_unsloth_model.py For this test, I used the TinyLlama-1.1B-Chat-v1.0 model And the output is: ['<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n ### Instruction:\n Continue the fibonnaci sequence.\n\n ### Input:\n 1, 1, 2, 3, 5, 8\n\n ### Response:\n 1, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, \n \n \n \n 1, 1, 1, the sequence:\n \n\n\n\n\n\n, 1, 1, 1, 1, 1, 1, 1, 1, the sequence:\n \n 1, 1, 1, 1, the task:\n \n \n \n \n \n \n \n \n \n \n 1,']

Is there something wrong with the test file, can you check and give me your feedback, thank you!

danielhanchen commented 9 months ago

@namtranase Hey sorry back! For TinyLlama, you'll have to follow exactly their prompt format:

# Install transformers from source - only needed for versions <= v4.34
# pip install git+https://github.com/huggingface/transformers.git
# pip install accelerate

import torch
from transformers import pipeline

pipe = pipeline("text-generation", model="TinyLlama/TinyLlama-1.1B-Chat-v1.0", torch_dtype=torch.bfloat16, device_map="auto")

# We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
# <|system|>
# You are a friendly chatbot who always responds in the style of a pirate.</s>
# <|user|>
# How many helicopters can a human eat in one sitting?</s>
# <|assistant|>
# ...
ammarali32 commented 9 months ago

@danielhanchen Thanks everything seems to be fine, will you support mapping the model on multi-gpus even if as sequential blocks for training?!

namtranase commented 9 months ago

The output improved after I changed the prompt to the chat template. Using the default (you provide above): <|system|> You are a friendly chatbot who always responds in the style of a pirate</s> <|user|> How many helicopters can a human eat in one sitting?</s> <|assistant|> According to a study published in the Journal of the American Medical Association (JAMA), a human can eat around 500-600 calories per hour. However, the amount of calories consumed by a human depends on factors such as age, gender, physical activity level, and dietary habits. It is not possible to accurately determine how many helicopters a human can eat in one sitting based on this study. Using my script (after changing to chat template): ['<|system|>\nYou are a friendly chatbot who always responds in the style of a pirate</s> \n<|user|>\nHow many helicopters can a human eat in one sitting?</s> \n<|assistant|>\nThere is no specific number of helicopes that a human beings that a human beings that a human beings that a human beings of helicers who can a human beings of helicers who can a human beings of helic helic helicers whoever, as there is a human beast a human be a human beast a human beast a human beast a human beast a human beastrobot is a human beverballoveget a human be a human beastrophet is a human:\n|user|user|user|user|user|user']

Some steps I would take to improve my experiments:

danielhanchen commented 9 months ago

@ammarali32 Oh it's not yet supported - we're working on making it out for our next release :)

danielhanchen commented 9 months ago

@namtranase Oh wait did you get gibberish after finetuning or before finetuning. If after finetuning, I suggest you directly use the chat template for finetuning, and not Alpaca. That's because the model you are using is realdy finetuned. I would use the non chat version if you want Alpaca style.

namtranase commented 9 months ago

Yes thank @danielhanchen, I add the do_sample=True, temperature=0.7, top_k=50, top_p=0.95 to the test script before finetuning and it output correctly. Will be updating my model based on the chat template. I think we can close this issue.

danielhanchen commented 9 months ago

@namtranase Oh not yet :) I actually am working on making inference 2-4x faster, which I might push in an hour :) Hopefully if you can, it would be wonderful if you could test it out :))

danielhanchen commented 9 months ago

@namtranase Ok not in an hour whoops - probably in the next few days!! :(

namtranase commented 9 months ago

It's ok man, your speed is still god 💯

danielhanchen commented 9 months ago

@namtranase @ammarali32 Just pushed a 2x faster inference on the main branch!! :) Hope you can try it out :)) It natively makes inference faster without any tricks - ie num_beams, batched etc all are faster :)

Call FastLanguageModel.for_inference(model) before doing inference to make it faster :) Call FastLanguageModel.for_training(model) to revert it back for finetuning.

https://github.com/unslothai/unsloth/assets/23090290/dab1ea44-34bc-4585-819f-3621614ff871

namtranase commented 9 months ago

Can you share a bit about how to speed up (if it is your secret source then no need), I check it and it faster, only need to apply FastLanguageModel.for_inference(model) ` Before: Inference time: 9.283933162689209 seconds

After: Inference time: 8.08155345916748 seconds `

danielhanchen commented 9 months ago

@namtranase Oh interesting its not as fast as I would have hoped :)) Oh its open source - you can inspect the code however you like :))

I would have expected 4s LOL - are you doing 4bit or 16bit and would it be possible you could tell me a bit more on your hardware :)))

danielhanchen commented 1 month ago

Closing for now!