Closed yqy2001 closed 1 year ago
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I met the same problem.
cc @gante
Hey @yqy2001 @TempleX98 👋
Unless the code is exactly the same, it is impossible to compare sample
implementations based on a few examples. Small things like the order of operations will produce very small logits differences and, unless the logits are exactly the same, the sampling step will pick different tokens for the same seed.
The best way to compare implementations is with greedy approaches with long outputs (especially if the comparison is done at a logit level!). In transformers
, that is done by passing do_sample=False
, return_dict=True
, and output_scores=True
.
EDIT: please note that since this issue was originally opened, a few llama-specific fixes and performance improvements were merged :)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
It looks like this behavior depends on what model you are using, try to change to chat model like Llama-2-7b-chat-hf will solve this issue.
System Info
transformers
version: mainWho can help?
@zphang @ArthurZucker @gan
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The official LLaMA repo generates a coherent and meaningful response to the below prompt, while the Huggingface LLaMA generates multiple responses that are not relevant to the prompt.
Official LLaMA Outputs
Please first substitute the prompt as:
Run for inference with the 13B model:
The output is:
Huggingface LLaMA
The code to generate output with transformers' llama:
The outputs seem to be more illogical (many sentences have nothing to do with
the meaning of life
):Analysis:
In LLaMA's official repo, they set the
temperature
to 0.8 andtop_p
to 0.95 for generation. I have aligned this in the transformers' generation.One difference is that LLaMA's official repo uses FSDP and my transformers' code has no distributed set-up. But I think this will not affect the inference performance (not certain).
Expected behavior
A script to reproduce the official LLaMA repo's results is expected, which will be a great sanity check about the huggingface llama implementation. Thanks!