unslothai / unsloth

Finetune Llama 3.1, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
15.54k stars 1.04k forks source link

Please include a tutorial on tinyllama for chat conversation with custom dataset #166

Open gracehubai opened 7 months ago

gracehubai commented 7 months ago

I hope to see a tinyllama example included with a custom conversational dataset, particularly with Chat ML. Or how do I achieve this with the provided tinyllama "default" example? My goal is to demonstrate a similar output as https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0 but with an additional dataset such as question answering.

Thank you very much.

danielhanchen commented 7 months ago

@gracehubai Oh I'm working on some chat completion notebooks as we speak! For now, a community member made one for Mistral: https://colab.research.google.com/drive/1bMOKOBzxQWUIGZBs_B0zm8pimuEnZdfM?usp=sharing - you can most likely copy the data prep part on replace it in the TinyLlama dataset :)

gracehubai commented 7 months ago

Hi, I'll start working on your content to see if it fits my requirement. Thanks for providing the notebook @gracehubai.

danielhanchen commented 7 months ago

@gracehubai No problems! Thanks to the community for the notebook :) I'll add a few in the following days to address other models (like TinyLlama) :)

gracehubai commented 7 months ago

I don't know why for some reason this model displayed some errors:

from unsloth import FastLanguageModel
import torch
max_seq_length = 4096 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/mistral-7b-instruct-v0.2",
    "unsloth/llama-2-7b-bnb-4bit",
    "unsloth/llama-2-13b-bnb-4bit",
    "unsloth/codellama-34b-bnb-4bit",
    "unsloth/tinyllama-bnb-4bit",
]

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF", # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py](https://localhost:8080/#) in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
   1119             try:
-> 1120                 config_class = CONFIG_MAPPING[config_dict["model_type"]]
   1121             except KeyError:

16 frames
KeyError: 'tinyllama'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
ValueError: The checkpoint you are trying to load has model type `tinyllama` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

During handling of the above exception, another exception occurred:

HTTPError                                 Traceback (most recent call last)
HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/adapter_config.json

The above exception was the direct cause of the following exception:

EntryNotFoundError                        Traceback (most recent call last)
EntryNotFoundError: 404 Client Error. (Request ID: Root=1-65c779ef-4284f5d0695e726473d32cc5;f4df20a8-57ef-444c-951a-c8eb3740bf33)

Entry Not Found for url: https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/adapter_config.json.

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
ValueError: Can't find 'adapter_config.json' at 'TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF'

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/unsloth/models/loader.py](https://localhost:8080/#) in from_pretrained(model_name, max_seq_length, dtype, load_in_4bit, token, device_map, rope_scaling, fix_tokenizer, use_gradient_checkpointing, *args, **kwargs)
     88                 peft_config = PeftConfig.from_pretrained(model_name, token = token)
     89             except:
---> 90                 raise RuntimeError(f"Unsloth: `{model_name}` is not a full model or a PEFT model.")
     91 
     92             # Check base model again for PEFT

RuntimeError: Unsloth: `TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF` is not a full model or a PEFT model.

Coz I'm more comfortable working with Zypher:

<|system|>
{system_message}</s>
<|user|>
{prompt}</s>
<|assistant|>

TinyLlama/TinyLlama-1.1B-Chat-v1.0 works though, along with unsloth version, but I realize that when I download the generated model, it generates gibberish results. So, I would like to try TheBlokes, but couldn't do it for this errors.

Any ideas?

danielhanchen commented 7 months ago

@gracehubai Oh wait sadly GGUF models don't work. Ohhh you're referring to the chat template. https://www.reddit.com/r/LocalLLaMA/comments/19c75cp/what_magic_does_ollama_do_to_models_tinyllama/

Ie you need to use TinyLlama Chat's apply_chat_template to make it not have gibberish outputs.

gracehubai commented 7 months ago

@gracehubai Oh wait sadly GGUF models don't work. Ohhh you're referring to the chat template. https://www.reddit.com/r/LocalLLaMA/comments/19c75cp/what_magic_does_ollama_do_to_models_tinyllama/

Ie you need to use TinyLlama Chat's apply_chat_template to make it not have gibberish outputs.

No, that has nothing to do with what I want to achieve. I'm trying to use the TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF (which isn't a gguf, only the name is GGUF but its a model published by TheBloke). I like using The Bloke's version though because if Zephyr formatting. While you mentioned Ollama, I'm much more comfortable using LM Studio for testing with different prompt template.

Or perhaps another question might help answer my question:

What prompt template works best with unsloth/tinyllama-bnb-4bit or TinyLlama/TinyLlama-1.1B-Chat-v1.0? I tried zypher prompt template to both of them but they didn't work so well. O observed that the models generated is not as capable as TheBloke's Zypher style template, that's why I was wondering why it didn't work for the Bloke's model.

I hope anyone here will try it to replicate the issue?

danielhanchen commented 7 months ago

@gracehubai I'm pretty sure https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/tree/main is a GGUF file, and hence why the error message RuntimeError: Unsloth: TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF is not a full model or a PEFT model. exists.

TinyLlama is not a finetuned model but a base model, so there is no chat prompt template. That is up to you to decide during the finetuning step.

TinyLlama Chat is a finetuned model using the ChatML format as described here: https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0#how-to-use

On another point, if you find Zephyr Mistral 7b for eg to not work under Unsloth, then we have a problem which I need to fix immediately.

TinyLlama is known to create gibberish since it's a smaller model.

danielhanchen commented 7 months ago

@gracehubai Oh so I tried TinyLlama Chat - hope this helps :)

from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    load_in_4bit = True,
    max_seq_length = 2048,
)

messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

FastLanguageModel.for_inference(model) # Enable native 2x faster inference

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(prompt, streamer = text_streamer, max_new_tokens = 1024)

The output will be:

<|system|>
You are a friendly chatbot who always responds in the style of a pirate</s> 
<|user|>
How many helicopters can a human eat in one sitting?</s> 
<|assistant|>
There is no scientific evidence to support the claim that a human can eat one helicopter in one sitting.
The number of helicopters that a human can eat in one sitting depends on the size and type of helicopter, as well as the individual's appetite and physical condition.
It is recommended to eat small portions of food and avoid overeating to avoid feeling full and uncomfortable.</s>
danielhanchen commented 7 months ago

@gracehubai I just added chat templates! It supports Zephyr (the one TInyLlama uses), ChatML, Vicuna etc - https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing