huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.08k stars 26.82k forks source link

llama 3.2, inference error with "repetition_penalty" in generation_config #34304

Open ruian1 opened 6 days ago

ruian1 commented 6 days ago

System Info

- `transformers` version: 4.45.1
- Platform: Linux-5.10.0-33-cloud-amd64-x86_64-with-glibc2.35
- Python version: 3.11.10
- Huggingface_hub version: 0.26.0
- Safetensors version: 0.4.5
- Accelerate version: 0.34.2
- Accelerate config:    not found
- PyTorch version (GPU?): 2.4.0+cu118 (True)
- Tensorflow version (GPU?): 2.17.0 (False)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: <fill in>
- Using GPU in script?: <fill in>
- GPU type: NVIDIA A100-SXM4-80GB

Who can help?

@zucchini-nlp

I ran into an error when adding the parameter of 'repetition_penalty' into generation_config using the exmaple in https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct. I added a generation_config at the bottom

import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_id)

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
image = Image.open(requests.get(url, stream=True).raw)

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "If I had to write a haiku for this one, it would be: "}
    ]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    image,
    input_text,
    add_special_tokens=False,
    return_tensors="pt"
).to(model.device)

#output = model.generate(**inputs, max_new_tokens=30) # this one works
#print(processor.decode(output[0]))

from transformers import GenerationConfig

meta_config = {
    "bos_token_id": 128000,
    "do_sample": True,
    "eos_token_id": [128001, 128008, 128009],
    "pad_token_id": 128004,
    "temperature": 0.1,
    "top_p": 0.9,
    "transformers_version": "4.45.0.dev0",
    "max_new_tokens": 256,
    "repetition_penalty": 1.2,
}

generation_config = GenerationConfig(**meta_config)

output = model.generate(**inputs, generation_config=generation_config)
print(processor.decode(output[0]))

This would result an error of

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:07<00:00,  1.53s/it]
2559
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [5,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
Traceback (most recent call last):
  File "/root/projects/llama_huggingface/evaluation.py", line 122, in <module>
    output = base_model.generate(**inputs, generation_config=generation_config)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/transformers/generation/utils.py", line 2048, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/transformers/generation/utils.py", line 3018, in _sample
    next_token_scores = logits_processor(input_ids, next_token_logits)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/transformers/generation/logits_process.py", line 104, in __call__
    scores = processor(input_ids, scores)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/transformers/generation/logits_process.py", line 356, in __call__
    score = torch.where(score < 0, score * self.penalty, score / self.penalty)
                                                         ~~~~~~^~~~~~~~~~~~~~
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

meta_config is taken from here https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct/blob/main/generation_config.json

Information

Tasks

Reproduction

  1. construct a generation_config with 'repetition_penalty' in it as in my code above, at the bottom
  2. run the generation

Expected behavior

Expect generation to run smoothly with repetition_penalty

zucchini-nlp commented 6 days ago

Hey @ruian1 !

Yes, unfortunately the model has different shapes for input vs output embeddings (see https://github.com/huggingface/transformers/issues/33819), thus it causes repetition penalty to fail because we try to gather logits for all input ids.

A worlaround is to resize output embeddings and add one extra token for lm_head. Please take a look how we resized input/output embedding shapes to llava here: https://github.com/huggingface/transformers/blob/94b50c5678047954f7790936439caff93957e638/src/transformers/models/llava_next_video/convert_llava_next_video_weights_to_hf.py#L214-L237

@ArthurZucker wondering if we can resize output embeddings for the model on the hub, as it fails not only for training with labels but for any operation we want to do with logits and input ids? Btw, aren;t those weights supposed to be tied? Sorry I fell out of loop when shipping the model

ArthurZucker commented 5 days ago

They are not tied! That's what cause the whole issue 😢 We can have a revision for this, but changing the weights now will unfortunately break. Tho documenting this with a snippet on how to circumvent is super welcome!

zucchini-nlp commented 5 days ago

sad 🥲 I'll add a small "Note" in the docs then to raise awareness about the issue and how to overcome it

ruian1 commented 5 days ago

Thanks for the discussion! I have some follow-up question

  1. It looks I should implement the the resize right after loading the model, is that correct?
  2. I tried making changes using exact code but got another error below
    
    Cell In[15], line 1
    ----> 1 model.language_model.model.embed_tokens.weight.data[vocab_size:] = torch.stack( 
      2  tuple( 
      3      (dist.sample() for _ in range(model.language_model.model.embed_tokens.weight.data[vocab_size:].shape[0])) 
      4  ), 
      5  dim=0, 
      6 ) 
      8 model.language_model.lm_head.weight.data[vocab_size:] = torch.stack( 
      9  tuple((dist.sample() for _ in range(model.language_model.lm_head.weight.data[vocab_size:].shape[0]))), 
     10  dim=0, 
     11 ) 

RuntimeError: stack expects a non-empty TensorList


3. I did a quick check and found that 
len(processor.tokenizer) is 128257 
vocab_size in config.json is 128256
, so I set the num_tokens to 128257

print(model.language_model.model.embed_tokens.weight.data.shape) print(model.language_model.lm_head.weight.data.shape)

vocab_size = 128256 num_tokens = vocab_size + 1

model.resize_token_embeddings(num_tokens, pad_to_multiple_of=pad_shape)

model.resize_token_embeddings(num_tokens)

print(model.language_model.model.embed_tokens.weight.data.shape) print(model.language_model.lm_head.weight.data.shape)

shows that 

torch.Size([128264, 4096]) torch.Size([128256, 4096]) torch.Size([128320, 4096]) torch.Size([128256, 4096])

Is it normal that `model.language_model.lm_head.weight.data.shape` does not resize?

4. tnhe I got same error from running this block 
```model.language_model.lm_head.weight.data[vocab_size:] = torch.stack( 
 tuple((dist.sample() for _ in range(model.language_model.lm_head.weight.data[vocab_size:].shape[0]))), 
 dim=0, 
)

error is 

RuntimeError Traceback (most recent call last) Cell In[12], line 8 1 model.language_model.model.embed_tokens.weight.data[vocabsize:] = torch.stack( 2 tuple( 3 (dist.sample() for in range(model.language_model.model.embed_tokens.weight.data[vocab_size:].shape[0])) 4 ), 5 dim=0, 6 ) ----> 8 model.language_model.lm_head.weight.data[vocabsize:] = torch.stack( 9 tuple((dist.sample() for in range(model.language_model.lm_head.weight.data[vocab_size:].shape[0]))), 10 dim=0, 11 )

RuntimeError: stack expects a non-empty TensorList

It would be nice if you can help checking what's wrong with my modifcation, thanks for the help!

zucchini-nlp commented 5 days ago

@ruian1 here is what I got to make it work, I'll add it to the docs soon

import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor

model_id = "mv11/11"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_id)

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
image = Image.open(requests.get(url, stream=True).raw)

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "If I had to write a haiku for this one, it would be: "}
    ]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    image,
    input_text,
    add_special_tokens=False,
    return_tensors="pt"
).to(model.device)

#output = model.generate(**inputs, max_new_tokens=30) # this one works
#print(processor.decode(output[0]))

from transformers import GenerationConfig

meta_config = {
    "bos_token_id": 128000,
    "do_sample": True,
    "eos_token_id": [128001, 128008, 128009],
    "pad_token_id": 128004,
    "temperature": 0.1,
    "top_p": 0.9,
    "transformers_version": "4.45.0.dev0",
    "max_new_tokens": 256,
    "repetition_penalty": 1.2,
}

generation_config = GenerationConfig(**meta_config)

pre_expansion_embeddings = model.language_model.lm_head.weight.data
mu = torch.mean(pre_expansion_embeddings, dim=0).float()
n = pre_expansion_embeddings.size()[0]
sigma = ((pre_expansion_embeddings - mu).T @ (pre_expansion_embeddings - mu)) / n
dist = torch.distributions.multivariate_normal.MultivariateNormal(mu, covariance_matrix=1e-5 * sigma)

num_new_tokens = 1 # 1 for special `image` token
lm_head_weights = model.language_model.lm_head.weight

new_token_embedding = torch.stack(tuple(dist.sample() for _ in range(num_new_tokens)), dim=0).to(device=lm_head_weights.device, dtype=lm_head_weights.dtype)
lm_head_weights.data = torch.cat([lm_head_weights.data, new_token_embedding], dim=0)
lm_head_weights.num_embeddings = lm_head_weights.data.shape[0]

output = model.generate(**inputs, generation_config=generation_config)
print(processor.decode(output[0]))