inference not respond with finetuned llama 3.1 8B bnb 4 bits merged 16bits save

dromeuf commented 4 months ago

I refined llama3.1 8b bnb 4bits according to your recommendations with my own train+eval dataset and saved as merged 16 bits. I now want to create an inference by loading the 16b merged model and using a code that works for example on the same m16 type with phi3, but impossible to get an answer. I get a warning message on the attention mask and no response. The problem must be mine, with an error I can't see, but if you can check ... ? I try different parameters with no result.

Thanks for your great work !

The save command :: model.save_pretrained_merged("/content/drive/MyDrive/AI/ModelsTensorsWeights/Model_Tokenizer_Unsloth_Llama31_8Bbnb4b_merged_16b_fHF", tokenizer, save_method = "merged_16bit",)

The 16bits merged local folder downloaded ::


(unsloth_env) dromeuf@MAIA:~$ ls -l /mnt/c/Anaconda_WSL/Model_Tokenizer_Unsloth_Llama31_8Bbnb4b_merged_16b_fHF/
total 15693104
-rwxrwxrwx 1 root root        935 Jul 30 21:43 config.json
-rwxrwxrwx 1 root root        184 Jul 30 21:43 generation_config.json
-rwxrwxrwx 1 root root 4976698672 Jul 31 09:05 model-00001-of-00004-001.safetensors
-rwxrwxrwx 1 root root 4999802720 Jul 31 09:05 model-00002-of-00004-004.safetensors
-rwxrwxrwx 1 root root 4915916176 Jul 31 09:04 model-00003-of-00004-003.safetensors
-rwxrwxrwx 1 root root 1168138808 Jul 30 21:44 model-00004-of-00004.safetensors
-rwxrwxrwx 1 root root      23950 Jul 30 21:44 model.safetensors.index.json
-rwxrwxrwx 1 root root        112 Jul 30 21:43 special_tokens_map.json
-rwxrwxrwx 1 root root    9085652 Jul 30 21:43 tokenizer.json
-rwxrwxrwx 1 root root      50929 Jul 30 21:43 tokenizer_config.json

The python code for inference this ::


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="/mnt/c/Anaconda_WSL/Model_Tokenizer_Unsloth_Llama31_8Bbnb4b_merged_16b_fHF",
    max_seq_length=4096,
    load_in_4bit=True,
#    bnb_4bit_compute_dtype=torch.float16,
#    bnb_4bit_quant_type="nf4",
#    bnb_4bit_use_double_quant=True
#    load_in_8bit=False,
#    torch_dtype=torch.float16,
#    device_map="auto",
    dtype=None,
)

from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "chatml", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
)

FastLanguageModel.for_inference(model) 

messages = [
    {"from": "human", "value": "What is Vercingetorix ?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 4096, use_cache = True )

tokenizer.batch_decode(outputs)

A print(model) & print(tokenizer) downloaded ::


LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaExtendedRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=4096, out_features=128256, bias=False)
)

PreTrainedTokenizerFast(name_or_path='/mnt/c/Anaconda_WSL/Model_Tokenizer_Unsloth_Llama31_8Bbnb4b_merged_16b_fHF', vocab_size=128000, model_max_length=131072, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<|begin_of_text|>', 'eos_token': '<|im_end|>', 'pad_token': '<|finetune_right_pad_id|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
    128000: AddedToken("<|begin_of_text|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128001: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
...

The output, just ::

The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input'sattention_mask` to obtain reliable results.

`

geehaad commented 4 months ago

I think there is a problem with llama3.1 instruct, I have an error with running it with Ollama after fine tunning

dromeuf commented 4 months ago

I get a result with this other code. It only works if you delete the token_type_ids. But the results are bad. It looks like I've knocked it out rather than training it. Llama 3.1 7b I Q8 gives much better results with LMStudio. There must be a problem because I've seen several articles on the subject, including one by unsloth with llama 3.1. They must have tested the results of their finetuning.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from bitsandbytes.nn import Linear4bit

# Chemin vers le répertoire contenant votre modèle affiné
model_path = "/mnt/c/Anaconda_WSL/Model_Tokenizer_Unsloth_Llama31_8Bbnb4b_merged_16b_fHF"

# Chargement du tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Configuration pour le chargement du modèle en 4 bits
bnb_config = {
    "load_in_4bit": True,
    "bnb_4bit_use_double_quant": True,
    "bnb_4bit_quant_type": "nf4",
    "bnb_4bit_compute_dtype": torch.bfloat16
}

# Chargement du modèle
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    quantization_config=bnb_config
)

print()
print(model)
print()
print(tokenizer)
print()

# Fonction d'inférence modifiée
def generate_text(prompt, max_new_tokens=4096):
    inputs = tokenizer(prompt, return_tensors="pt")

    # Supprimer token_type_ids s'il est présent
    if 'token_type_ids' in inputs:
        del inputs['token_type_ids']

    # Déplacer les entrées sur le même device que le modèle
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.35,
            top_p=1,
        )

    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return generated_text

# Exemple d'utilisation
prompt = "What is Vercingetorix ?"
result = generate_text(prompt)
print(result)
print()

prompt = "quote me chapter 1 of book 1 of the Gallic War in English. I would like the characters' proper names to appear in their full, long form."
result = generate_text(prompt)
print(result)
print()

danielhanchen commented 3 months ago

Ye the issue is token_type_ids in the first case - for the model itself in LM Studio, it's possible the chat template isn't right somewhere - but yes Q8_0 will do better than Q4_K_M, but it shouldn't be that dramatic

NazimHAli commented 1 month ago

Seeing this with other models as well. Look at the inference section outputs from the notebooks in the README, examples:

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.

['<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nContinue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThe Fibonacci sequence is a series of numbers in which each number is the sum of the two preceding numbers. The sequence is: 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144.<|eot_id|>']

danielhanchen commented 1 month ago

@NazimHAli If you're still seeing the issue, please update Unsloth!

pip uninstall unsloth -y
pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

NazimHAli commented 1 month ago

@NazimHAli If you're still seeing the issue, please update Unsloth!
pip uninstall unsloth -y
pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

Seeing the same issue after re-running Llama-3.2-3B-Instruct with the latest versions:

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.10.2: Fast Llama patching. Transformers = 4.46.0.dev0.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!

Output of inference:

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.

['<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nContinue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThe Fibonacci sequence is a series of numbers in which each number is the sum of the two preceding numbers, starting from 1 and 1. It\'s a mathematical concept named after the Italian mathematician Leonardo Fibonacci, who introduced it in his book "Liber Abaci" in 1202. The sequence appears to']

giantvision commented 1 week ago

@NazimHAli If you're still seeing the issue, please update Unsloth!
pip uninstall unsloth -y
pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

Why is this problem occurring and can you elaborate on why?

Met the same problem.

unslothai / unsloth

inference not respond with finetuned llama 3.1 8B bnb 4 bits merged 16bits save #842