Unsloth breaks the inference ?!

ammarali32 commented 11 months ago

Hello, thanks for your contribution it is really promising but for some reason it breaks the generation and inference Here is an example:

from unsloth import FastLlamaModel
import torch
max_seq_length = 1024 # Can change to any number <= 4096
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# Load Llama model
model, tokenizer = FastLlamaModel.from_pretrained(
    model_name = "TheBloke/Llama-2-7B-fp16", # Supports any llama model
    max_seq_length = max_seq_length,
    dtype=dtype,
    load_in_4bit = False
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
inputs = tokenizer.encode("the concept of ", return_tensors="pt", add_special_tokens = True).to(model.device)
answer = model.generate(inputs, max_new_tokens = 20)
tokenizer.batch_decode(answer, skip_special_tokens = False)

The output:

==((====))==  Unsloth: Fast Llama patching release 23.11
   \\   /|    GPU: A100-SXM4-40GB. Max memory: 39.587 GB
O^O/ \_/ \    CUDA compute capability = 8.0
\        /    Pytorch version: 2.1.0+cu118. CUDA Toolkit = 11.8
 "-____-"     bfloat16 support = TRUE

Loading checkpoint shards: 100%
2/2 [00:14<00:00, 6.61s/it]
['<s> the concept of 1<s> Tags\n \\\n \\\n \\\n \\\n \\\n \\\n \\\n \\\n']

I have tried more than 4 different Llamas including yours and the same issue.

danielhanchen commented 11 months ago

@ammarali32 You're correct - the issue is in fact my code - I forgot that generation's KV cache handling is different from training....... OOPS

["<s> Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8, in ,. Home1 (,\n  ( P R, (  inЉa...2 in , ,. Ch,, Pin1    , ’s P(  ( T. Home\n , (. Ever. P, in W...Љi:1  M I\n Chl\n Ch/ (l. A:,l,  !s:in'1B’  ::’ . The-,\n / to ( T  P \n P (   , /,  ,\n Ch  ( to. Home a \n R (  , A1 ,  a , by , lЋ, T' I\n T. P s ' /. Home (\n U ...; toЉ\n F a!' by...; ss . Home (l l ,a- I to\n -\n  (   (,,,  RS (-  . Home...\n Home, \n Home /\n The (,a T ( T- (\n New\n by’. Home. Home inЉ C. Home"]

when it's supposed to be

['<s> Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121417, 196835, 322587, 530487, 864030, 1398269, 2256353, 3622777, 5988889, 9872040, 16099689, 26632467, 43321641, 70211837, 112378173, 183084336, 2942']

danielhanchen commented 11 months ago

@ammarali32 I fixed it!!! It would be awesome if you can try it out! I also updated the Alpaca Colab example https://colab.research.google.com/drive/1oW55fBmwzCOrBVX66RcpptL3a99qWBxb?usp=sharing

namtranase commented 9 months ago

Unsloth 2024.1 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers. ['<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n ### Instruction:\n Continue the fibonnaci sequence.\n\n ### Input:\n 1, 1, 2, 3, 5, 8\n\n ### Response:\n 1. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,. Below,']

I got the same error above, do you have any suggestions for me to fix this

danielhanchen commented 9 months ago

@namtranase Did you upgrade Unsloth to the current latest or is this on Colab? I resolved this issue which popped up again unfortunately a few days ago - apologies!

namtranase commented 9 months ago

Thank you for your quick response @danielhanchen, I pull the latest and follow the installation steps for my T4 device:

Create conda
pip install "unsloth[colab] @ git+https://github.com/unslothai/unsloth.git"

Run the file: test_unsloth_model.py For this test, I used the TinyLlama-1.1B-Chat-v1.0 model And the output is: ['<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n ### Instruction:\n Continue the fibonnaci sequence.\n\n ### Input:\n 1, 1, 2, 3, 5, 8\n\n ### Response:\n 1, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, \n \n \n \n 1, 1, 1, the sequence:\n \n\n\n\n\n\n, 1, 1, 1, 1, 1, 1, 1, 1, the sequence:\n \n 1, 1, 1, 1, the task:\n \n \n \n \n \n \n \n \n \n \n 1,']

Is there something wrong with the test file, can you check and give me your feedback, thank you!

danielhanchen commented 9 months ago

@namtranase Hey sorry back! For TinyLlama, you'll have to follow exactly their prompt format:

# Install transformers from source - only needed for versions <= v4.34
# pip install git+https://github.com/huggingface/transformers.git
# pip install accelerate

import torch
from transformers import pipeline

pipe = pipeline("text-generation", model="TinyLlama/TinyLlama-1.1B-Chat-v1.0", torch_dtype=torch.bfloat16, device_map="auto")

# We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
# <|system|>
# You are a friendly chatbot who always responds in the style of a pirate.</s>
# <|user|>
# How many helicopters can a human eat in one sitting?</s>
# <|assistant|>
# ...

ammarali32 commented 9 months ago

@danielhanchen Thanks everything seems to be fine, will you support mapping the model on multi-gpus even if as sequential blocks for training?!

namtranase commented 9 months ago

The output improved after I changed the prompt to the chat template. Using the default (you provide above): <|system|> You are a friendly chatbot who always responds in the style of a pirate</s> <|user|> How many helicopters can a human eat in one sitting?</s> <|assistant|> According to a study published in the Journal of the American Medical Association (JAMA), a human can eat around 500-600 calories per hour. However, the amount of calories consumed by a human depends on factors such as age, gender, physical activity level, and dietary habits. It is not possible to accurately determine how many helicopters a human can eat in one sitting based on this study. Using my script (after changing to chat template): ['<|system|>\nYou are a friendly chatbot who always responds in the style of a pirate</s> \n<|user|>\nHow many helicopters can a human eat in one sitting?</s> \n<|assistant|>\nThere is no specific number of helicopes that a human beings that a human beings that a human beings that a human beings of helicers who can a human beings of helicers who can a human beings of helic helic helicers whoever, as there is a human beast a human be a human beast a human beast a human beast a human beast a human beastrobot is a human beverballoveget a human be a human beastrophet is a human:\n|user|user|user|user|user|user']

Some steps I would take to improve my experiments:

Keep trying on the prompt template
Increase the number of epochs for finetuning
Investigate why adding the get_peft_model makes the output different from the original If you think I need to try more approaches, please let me know, really enjoy playing with the repo

danielhanchen commented 9 months ago

@ammarali32 Oh it's not yet supported - we're working on making it out for our next release :)

danielhanchen commented 9 months ago

@namtranase Oh wait did you get gibberish after finetuning or before finetuning. If after finetuning, I suggest you directly use the chat template for finetuning, and not Alpaca. That's because the model you are using is realdy finetuned. I would use the non chat version if you want Alpaca style.

namtranase commented 9 months ago

Yes thank @danielhanchen, I add the do_sample=True, temperature=0.7, top_k=50, top_p=0.95 to the test script before finetuning and it output correctly. Will be updating my model based on the chat template. I think we can close this issue.

danielhanchen commented 9 months ago

@namtranase Oh not yet :) I actually am working on making inference 2-4x faster, which I might push in an hour :) Hopefully if you can, it would be wonderful if you could test it out :))

danielhanchen commented 9 months ago

@namtranase Ok not in an hour whoops - probably in the next few days!! :(

namtranase commented 9 months ago

It's ok man, your speed is still god 💯

danielhanchen commented 9 months ago

@namtranase @ammarali32 Just pushed a 2x faster inference on the main branch!! :) Hope you can try it out :)) It natively makes inference faster without any tricks - ie num_beams, batched etc all are faster :)

Call FastLanguageModel.for_inference(model) before doing inference to make it faster :) Call FastLanguageModel.for_training(model) to revert it back for finetuning.

https://github.com/unslothai/unsloth/assets/23090290/dab1ea44-34bc-4585-819f-3621614ff871

namtranase commented 9 months ago

Can you share a bit about how to speed up (if it is your secret source then no need), I check it and it faster, only need to apply FastLanguageModel.for_inference(model) ` Before: Inference time: 9.283933162689209 seconds

After: Inference time: 8.08155345916748 seconds `

danielhanchen commented 9 months ago

@namtranase Oh interesting its not as fast as I would have hoped :)) Oh its open source - you can inspect the code however you like :))

I would have expected 4s LOL - are you doing 4bit or 16bit and would it be possible you could tell me a bit more on your hardware :)))

danielhanchen commented 1 month ago

Closing for now!

unslothai / unsloth

Unsloth breaks the inference ?! #21