meta-llama/Llama-3.1-8B-Instruct generate output shows unexpected padding

aslanxie commented 1 month ago

System Info

optimum-habana: v1.13.2
habanalabs-dkms/jammy 1.17.1-40
DOCKER_IMAGE=vault.habana.ai/gaudi-docker/1.17.1/ubuntu22.04/habanalabs/pytorch-installer-2.3.1:latest

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

clone and install optimum-habana

git clone https://github.com/huggingface/optimum-habana
cd optimum-habana && git checkout v1.13.2
pip install .

move to examples/text-generation and run python3 run_generation.py --model_name_or_path meta-llama/Llama-3.1-8B-Instruct --use_hpu_graphs --limit_hpu_graph --use_kv_cache --reuse_cache --trim_logits --attn_softmax_bf16 --max_input_tokens 512 --max_new_tokens 2048 --bf16 --batch_size 1 --warmup 0 --n_iterations 3

The output looks like below. The flag '!' is unexpected padding in output:

Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ('!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!DeepSpeed is a machine learning framework that provides a set of tools and libraries for scaling up deep learning models and training them on large datasets. It is designed to be highly efficient and scalable, allowing users to train large models on a single machine or distribute the training process across multiple machines.\n\nHere are some key features of DeepSpeed:\n\n1.  **Efficient Training**: DeepSpeed provides a set of techniques to optimize the training process, including gradient accumulation, mixed precision training, and model parallelism. These techniques can significantly reduce the training time and memory usage.\n2.  **Distributed Training** ...

Expected behavior

The expected output should be:

Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ('DeepSpeed is a machine learning framework that provides a set of tools and libraries for scaling up deep learning models and training them on large datasets. It is designed to be highly efficient and scalable ...

aslanxie commented 1 month ago

From llama3, the bos/eos token id are changed, for example Llama-3.1-8B-Instruct:

 "bos_token_id": 128000,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],

In text-generation example, it force model.generation_config.pad_token_id = 0, and token id 0 represents '!' in meta-llama/Llama-3.1-8B-Instruct tokenizer table. So, it looks like token id mismatch.

regisss commented 3 weeks ago

@aslanxie This should have been fixed by https://github.com/huggingface/optimum-habana/pull/1444 that I just merged into main. Can you try again on the main branch and let me know if that works on your side too?

aslanxie commented 3 weeks ago

@regisss It's working on v1.14.0 now.

huggingface / optimum-habana