Open gracehubai opened 7 months ago
@gracehubai Oh I'm working on some chat completion notebooks as we speak! For now, a community member made one for Mistral: https://colab.research.google.com/drive/1bMOKOBzxQWUIGZBs_B0zm8pimuEnZdfM?usp=sharing - you can most likely copy the data prep part on replace it in the TinyLlama dataset :)
Hi, I'll start working on your content to see if it fits my requirement. Thanks for providing the notebook @gracehubai.
@gracehubai No problems! Thanks to the community for the notebook :) I'll add a few in the following days to address other models (like TinyLlama) :)
I don't know why for some reason this model displayed some errors:
from unsloth import FastLanguageModel
import torch
max_seq_length = 4096 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
"unsloth/mistral-7b-bnb-4bit",
"unsloth/mistral-7b-instruct-v0.2",
"unsloth/llama-2-7b-bnb-4bit",
"unsloth/llama-2-13b-bnb-4bit",
"unsloth/codellama-34b-bnb-4bit",
"unsloth/tinyllama-bnb-4bit",
]
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF", # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
# token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarning:
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py](https://localhost:8080/#) in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
1119 try:
-> 1120 config_class = CONFIG_MAPPING[config_dict["model_type"]]
1121 except KeyError:
16 frames
KeyError: 'tinyllama'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
ValueError: The checkpoint you are trying to load has model type `tinyllama` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
During handling of the above exception, another exception occurred:
HTTPError Traceback (most recent call last)
HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/adapter_config.json
The above exception was the direct cause of the following exception:
EntryNotFoundError Traceback (most recent call last)
EntryNotFoundError: 404 Client Error. (Request ID: Root=1-65c779ef-4284f5d0695e726473d32cc5;f4df20a8-57ef-444c-951a-c8eb3740bf33)
Entry Not Found for url: https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/adapter_config.json.
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
ValueError: Can't find 'adapter_config.json' at 'TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF'
During handling of the above exception, another exception occurred:
RuntimeError Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/unsloth/models/loader.py](https://localhost:8080/#) in from_pretrained(model_name, max_seq_length, dtype, load_in_4bit, token, device_map, rope_scaling, fix_tokenizer, use_gradient_checkpointing, *args, **kwargs)
88 peft_config = PeftConfig.from_pretrained(model_name, token = token)
89 except:
---> 90 raise RuntimeError(f"Unsloth: `{model_name}` is not a full model or a PEFT model.")
91
92 # Check base model again for PEFT
RuntimeError: Unsloth: `TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF` is not a full model or a PEFT model.
Coz I'm more comfortable working with Zypher:
<|system|>
{system_message}</s>
<|user|>
{prompt}</s>
<|assistant|>
TinyLlama/TinyLlama-1.1B-Chat-v1.0
works though, along with unsloth version, but I realize that when I download the generated model, it generates gibberish results. So, I would like to try TheBlokes, but couldn't do it for this errors.
Any ideas?
@gracehubai Oh wait sadly GGUF models don't work. Ohhh you're referring to the chat template. https://www.reddit.com/r/LocalLLaMA/comments/19c75cp/what_magic_does_ollama_do_to_models_tinyllama/
Ie you need to use TinyLlama Chat's apply_chat_template
to make it not have gibberish outputs.
@gracehubai Oh wait sadly GGUF models don't work. Ohhh you're referring to the chat template. https://www.reddit.com/r/LocalLLaMA/comments/19c75cp/what_magic_does_ollama_do_to_models_tinyllama/
Ie you need to use TinyLlama Chat's
apply_chat_template
to make it not have gibberish outputs.
No, that has nothing to do with what I want to achieve. I'm trying to use the TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF
(which isn't a gguf, only the name is GGUF but its a model published by TheBloke). I like using The Bloke's version though because if Zephyr formatting. While you mentioned Ollama, I'm much more comfortable using LM Studio for testing with different prompt template.
Or perhaps another question might help answer my question:
What prompt template works best with unsloth/tinyllama-bnb-4bit
or TinyLlama/TinyLlama-1.1B-Chat-v1.0
? I tried zypher prompt template to both of them but they didn't work so well. O observed that the models generated is not as capable as TheBloke's Zypher style template, that's why I was wondering why it didn't work for the Bloke's model.
I hope anyone here will try it to replicate the issue?
@gracehubai I'm pretty sure https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/tree/main is a GGUF file, and hence why the error message RuntimeError: Unsloth: TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF is not a full model or a PEFT model.
exists.
TinyLlama is not a finetuned model but a base model, so there is no chat prompt template. That is up to you to decide during the finetuning step.
TinyLlama Chat is a finetuned model using the ChatML format as described here: https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0#how-to-use
On another point, if you find Zephyr Mistral 7b for eg to not work under Unsloth, then we have a problem which I need to fix immediately.
TinyLlama is known to create gibberish since it's a smaller model.
@gracehubai Oh so I tried TinyLlama Chat - hope this helps :)
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
"TinyLlama/TinyLlama-1.1B-Chat-v1.0",
load_in_4bit = True,
max_seq_length = 2048,
)
messages = [
{
"role": "system",
"content": "You are a friendly chatbot who always responds in the style of a pirate",
},
{"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize = True,
add_generation_prompt = True,
return_tensors = "pt",
).to("cuda")
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(prompt, streamer = text_streamer, max_new_tokens = 1024)
The output will be:
<|system|>
You are a friendly chatbot who always responds in the style of a pirate</s>
<|user|>
How many helicopters can a human eat in one sitting?</s>
<|assistant|>
There is no scientific evidence to support the claim that a human can eat one helicopter in one sitting.
The number of helicopters that a human can eat in one sitting depends on the size and type of helicopter, as well as the individual's appetite and physical condition.
It is recommended to eat small portions of food and avoid overeating to avoid feeling full and uncomfortable.</s>
@gracehubai I just added chat templates! It supports Zephyr (the one TInyLlama uses), ChatML, Vicuna etc - https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing
I hope to see a tinyllama example included with a custom conversational dataset, particularly with Chat ML. Or how do I achieve this with the provided tinyllama "default" example? My goal is to demonstrate a similar output as https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0 but with an additional dataset such as question answering.
Thank you very much.