Llama-2 output from the forward function is nonsense. Output from `.generate()` is okay

Tai-Mai commented 5 months ago

System Info

transformers: 4.40.2
platform: Ubuntu (compute cluster)
python version: 3.12.2

Who can help?

@ArthurZucker @younesbelkada @gante

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Please run the following script:

import torch
from transformers import BitsAndBytesConfig, LlamaForCausalLM, LlamaTokenizer

# model_id = "meta-llama/Llama-2-7b-chat-hf"
model_id = "meta-llama/Llama-2-7b-hf"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
tokenizer = LlamaTokenizer.from_pretrained(model_id)
tokenizer.add_special_tokens({"pad_token": "<pad>"})
model = LlamaForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto", cache_dir="./cache")
model.resize_token_embeddings(len(tokenizer))
model.config.pad_token_id = tokenizer.pad_token_id
model.eval()

model_input = tokenizer(
    # "Hello, how are you? ###Assistant:", 
    "Hello, how are you?", 
    return_tensors="pt",
    max_length=20,
    truncation=True
    # padding="max_length",
)
model_input["input_ids"] = model_input["input_ids"].to("cuda")
model_input["attention_mask"] = model_input["attention_mask"].to("cuda")

model_output = model.generate(model_input['input_ids'], max_new_tokens=50)
output_string = tokenizer.batch_decode(model_output)[0]
print("Output with `.generate()`:\n" + output_string)
print("\n")

model_output = model(**model_input)
output_string = tokenizer.decode(torch.argmax(model_output.logits.squeeze(), -1))
print("Output with `.forward()`:\n" + output_string)

I get an output typically somewhat similar to this:

Loading checkpoint shards: 100%|██████████| 2/2 [00:07<00:00,  3.87s/it]
Output with `.generate()`:
<s> Hello, how are you? I’m doing well, thanks for asking. everybody is in good health, so I am happy. I hope you are well too.
I’m very glad that you have visited my website. I’m sure you are looking for a

Output with `.forward()`:
nobody, I are you? I

However, .generate() also often outputs (grammatical) gibberish or just slips into German for no reason. I've seen the following words a lot of times: Hinweis, Unterscheidung, nobody. If I enable padding, the output from the .generate() function will have nothing to do with my prompt "Hello, how are you?".

Expected behavior

I expected the forward function to give me the same output as the .generate() function. The reason I wanted to use the forward function is because I have to train my model in a custom PyTorch training loop and, as far as I understand, that's not possible with .generate().

I've been trying to troubleshoot this for 2 weeks and I'm getting really desperate. Any kind of help would be very much appreciated.

bhuvanmdev commented 5 months ago

@Tai-Mai , shouldn't the code be model_output.logits.squeeze()[-1]? This is because the forward method predicts the last token based on all the preceding ones. To replicate the .generate() method's output, you'd likely need to loop the original tokens with the newly predicted token. Do let me know if I'm wrong. Also, for generic model training, Hugging Face provides a Trainer class .

Tai-Mai commented 5 months ago

@bhuvanmdev Thanks for the reply! Yes, you're right. I was confused because I got multiple tokens (nobody, I are you? I) and I thought that meant that the forward function was already implemented in a way that would take care of the auto-regressive generation for me but that's of course not the case.

Here's the forum post that helped me understand it more as well.

https://discuss.huggingface.co/t/llama-2-output-from-forward-function-is-nonsense-generate-is-okay/88815/2?u=tai-mai

I'll close the issue. Thanks again.

huggingface / transformers