Open dromeuf opened 4 months ago
I think there is a problem with llama3.1 instruct, I have an error with running it with Ollama after fine tunning
I get a result with this other code. It only works if you delete the token_type_ids. But the results are bad. It looks like I've knocked it out rather than training it. Llama 3.1 7b I Q8 gives much better results with LMStudio. There must be a problem because I've seen several articles on the subject, including one by unsloth with llama 3.1. They must have tested the results of their finetuning.
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from bitsandbytes.nn import Linear4bit
# Chemin vers le répertoire contenant votre modèle affiné
model_path = "/mnt/c/Anaconda_WSL/Model_Tokenizer_Unsloth_Llama31_8Bbnb4b_merged_16b_fHF"
# Chargement du tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Configuration pour le chargement du modèle en 4 bits
bnb_config = {
"load_in_4bit": True,
"bnb_4bit_use_double_quant": True,
"bnb_4bit_quant_type": "nf4",
"bnb_4bit_compute_dtype": torch.bfloat16
}
# Chargement du modèle
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="auto",
quantization_config=bnb_config
)
print()
print(model)
print()
print(tokenizer)
print()
# Fonction d'inférence modifiée
def generate_text(prompt, max_new_tokens=4096):
inputs = tokenizer(prompt, return_tensors="pt")
# Supprimer token_type_ids s'il est présent
if 'token_type_ids' in inputs:
del inputs['token_type_ids']
# Déplacer les entrées sur le même device que le modèle
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.35,
top_p=1,
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
return generated_text
# Exemple d'utilisation
prompt = "What is Vercingetorix ?"
result = generate_text(prompt)
print(result)
print()
prompt = "quote me chapter 1 of book 1 of the Gallic War in English. I would like the characters' proper names to appear in their full, long form."
result = generate_text(prompt)
print(result)
print()
Ye the issue is token_type_ids
in the first case - for the model itself in LM Studio, it's possible the chat template isn't right somewhere - but yes Q8_0 will do better than Q4_K_M, but it shouldn't be that dramatic
Seeing this with other models as well. Look at the inference section outputs from the notebooks in the README, examples:
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
['<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nContinue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThe Fibonacci sequence is a series of numbers in which each number is the sum of the two preceding numbers. The sequence is: 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144.<|eot_id|>']
@NazimHAli If you're still seeing the issue, please update Unsloth!
pip uninstall unsloth -y
pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
@NazimHAli If you're still seeing the issue, please update Unsloth!
pip uninstall unsloth -y pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
Seeing the same issue after re-running Llama-3.2-3B-Instruct with the latest versions:
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))== Unsloth 2024.10.2: Fast Llama patching. Transformers = 4.46.0.dev0.
\\ /| GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \ Pytorch: 2.4.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\ / Bfloat16 = FALSE. FA [Xformers = 0.0.28.post1. FA2 = False]
"-____-" Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Output of inference:
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
['<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nContinue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThe Fibonacci sequence is a series of numbers in which each number is the sum of the two preceding numbers, starting from 1 and 1. It\'s a mathematical concept named after the Italian mathematician Leonardo Fibonacci, who introduced it in his book "Liber Abaci" in 1202. The sequence appears to']
@NazimHAli If you're still seeing the issue, please update Unsloth!
pip uninstall unsloth -y pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
Why is this problem occurring and can you elaborate on why?
Met the same problem.
I refined llama3.1 8b bnb 4bits according to your recommendations with my own train+eval dataset and saved as merged 16 bits. I now want to create an inference by loading the 16b merged model and using a code that works for example on the same m16 type with phi3, but impossible to get an answer. I get a warning message on the attention mask and no response. The problem must be mine, with an error I can't see, but if you can check ... ? I try different parameters with no result.
Thanks for your great work !
The save command ::
model.save_pretrained_merged("/content/drive/MyDrive/AI/ModelsTensorsWeights/Model_Tokenizer_Unsloth_Llama31_8Bbnb4b_merged_16b_fHF", tokenizer, save_method = "merged_16bit",)
The 16bits merged local folder downloaded ::
The python code for inference this ::
A print(model) & print(tokenizer) downloaded ::
The output, just ::
The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's
attention_mask` to obtain reliable results.`