Different outputs of the official LLaMA repo and transformers' implementation

yqy2001 commented 1 year ago

System Info

transformers version: main
Platform: Linux-5.4.0-113-generic-x86_64-with-glibc2.10
Python version: 3.8.16
Huggingface_hub version: 0.11.1
PyTorch version (GPU?): 1.12.1 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: Not

Who can help?

@zphang @ArthurZucker @gan

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

The official LLaMA repo generates a coherent and meaningful response to the below prompt, while the Huggingface LLaMA generates multiple responses that are not relevant to the prompt.

Official LLaMA Outputs

git clone git@github.com:facebookresearch/llama.git
cd llama
pip install -r requirements.txt
pip install -e .

Please first substitute the prompt as:

prompts = ["I believe the meaning of life is"]

Run for inference with the 13B model:

torchrun --nproc_per_node 2 example.py --ckpt_dir $TARGET_FOLDER/13B --tokenizer_path $TARGET_FOLDER/tokenizer.model

The output is:

I believe the meaning of life is to love others, love ourselves, and love our God. The way we do that is by showing compassion and acceptance. We have to love the people around us even when they are struggling. We have to love ourselves even when we are failing. We have to love God even when we are not certain. This is the meaning of life.

Huggingface LLaMA

The code to generate output with transformers' llama:

import transformers
import torch

torch.manual_seed(1)

tokenizer = transformers.LlamaTokenizer.from_pretrained("$YOUR_CONVERTED_DIR/tokenizer/")
model = transformers.LlamaForCausalLM.from_pretrained("$YOUR_CONVERTED_DIR/llama-13b/").half()
model.cuda()
prompt = "I believe the meaning of life is"
inputs = tokenizer(prompt, return_tensors="pt")

generated_ids = model.generate(inputs.input_ids.cuda(), max_new_tokens=256, do_sample=True, top_p=0.95, temperature=0.8)
print(tokenizer.batch_decode(generated_ids)[0])

The outputs seem to be more illogical (many sentences have nothing to do with the meaning of life):

I believe the meaning of life is to give life meaning
I believe that we are here to be of service to others
I believe that the purpose of life is to grow in wisdom and love
I believe that life is not all about me
I believe that what I give I receive and what I receive I give
I believe that the journey is more important than the destination
I believe that we have a gift to share and that that gift is not for ourselves
I believe that I am the right person in the right place at the right time
I believe that the only thing we have to be concerned about is the present moment
I believe that God is in everyone and everything
I believe that we are all connected
I believe that we are all equal and unique
I believe that we are all responsible for the world we live in
I believe that we are all perfect and whole
I believe that we are all worthy of love
I believe that we are all on a journey of self-discovery
I believe that we are all meant to do what we do
I believe that we are all perfect in our own way
I believe that we are all loved
I believe that we are all loved by God
I believe that there is only one God
I believe that God

Analysis:

In LLaMA's official repo, they set the temperature to 0.8 and top_p to 0.95 for generation. I have aligned this in the transformers' generation.

One difference is that LLaMA's official repo uses FSDP and my transformers' code has no distributed set-up. But I think this will not affect the inference performance (not certain).

Expected behavior

A script to reproduce the official LLaMA repo's results is expected, which will be a great sanity check about the huggingface llama implementation. Thanks!

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

TempleX98 commented 1 year ago

I met the same problem.

amyeroberts commented 1 year ago

cc @gante

gante commented 1 year ago

Hey @yqy2001 @TempleX98 👋

Unless the code is exactly the same, it is impossible to compare sample implementations based on a few examples. Small things like the order of operations will produce very small logits differences and, unless the logits are exactly the same, the sampling step will pick different tokens for the same seed.

The best way to compare implementations is with greedy approaches with long outputs (especially if the comparison is done at a logit level!). In transformers, that is done by passing do_sample=False, return_dict=True, and output_scores=True.

EDIT: please note that since this issue was originally opened, a few llama-specific fixes and performance improvements were merged :)

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

matthewxfz commented 1 year ago

It looks like this behavior depends on what model you are using, try to change to chat model like Llama-2-7b-chat-hf will solve this issue.

huggingface / transformers