RuntimeError: CUDA error: device-side assert triggered when used together with LLama3.2-8B-Instruct

ayylemao commented 6 days ago

lm-format-enforcer==0.10.7
torch==2.4.1+cu121
transformers==4.45.0

When using the library together with the newly released Llama3.2-11B-Instruct we get a CUDA error.


model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",
)
processor = AutoProcessor.from_pretrained(model_id)

schema = BrandAndLogin.model_json_schema()
parser = JsonSchemaParser(schema)
prefix_func = build_transformers_prefix_allowed_tokens_fn(processor.tokenizer, parser)

[...]

messages = [
    {"role": "system", "content": [
        {"type": "text", "text": system}
    ]},
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": user}
    ]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)

inputs = processor(image, input_text, return_tensors="pt").to('cuda:0')
start_generation = inputs['input_ids'].shape[1]
output = model.generate(**inputs, max_new_tokens=512, prefix_allowed_tokens_fn=prefix_func)

leads to following error:

../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [217,0,0], thread: [107,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/projects/llama32test/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/projects/llama32test/.venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2048, in generate
    result = self._sample(
  File "/home/projects/llama32test/.venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 3018, in _sample
    next_token_scores = logits_processor(input_ids, next_token_logits)
  File "/home/projects/llama32test/.venv/lib/python3.10/site-packages/transformers/generation/logits_process.py", line 104, in __call__
    scores = processor(input_ids, scores)
  File "/home/projects/llama32test/.venv/lib/python3.10/site-packages/transformers/generation/logits_process.py", line 1379, in __call__
    mask[batch_id * self._num_beams + beam_id, prefix_allowed_tokens] = 0
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Is there a fix for this? I get that the model is quite new but I've never had problems with other newly released models on Hugginface. Also vision models like Idefics3-Llama-8B worked with the lm-format-enforcer without any problems.

noamgat commented 6 days ago

Thanks for reporting the issue! Can you supply a complete example?

ayylemao commented 6 days ago

Thank you for getting back to me. Here a complete script:

import json
import torch
from PIL import Image
from pydantic import BaseModel
from transformers import MllamaForConditionalGeneration, AutoProcessor
from typing import List
from lmformatenforcer import JsonSchemaParser
from lmformatenforcer.integrations.transformers import build_transformers_prefix_allowed_tokens_fn

class Brand(BaseModel):
    brands: List[str]

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",
)
processor = AutoProcessor.from_pretrained(model_id)

schema = Brand.model_json_schema()
parser = JsonSchemaParser(schema)
prefix_func = build_transformers_prefix_allowed_tokens_fn(processor.tokenizer, parser)

user = '''Tell me what brands you can see on the provided screenshot, format it in json with the following format: '''
image_path = 'x.png'
image = Image.open(image_path)

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": user+json.dumps(schema)}
    ]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)

inputs = processor(image, input_text, return_tensors="pt").to('cuda:0')
start_generation = inputs['input_ids'].shape[1]
output = model.generate(**inputs, max_new_tokens=512, prefix_allowed_tokens_fn=prefix_func)
result = processor.batch_decode(output[:, start_generation:], skip_special_tokens=True)[0]
print(result)

ayylemao commented 2 days ago

any luck so far pin pointing the issue?

noamgat commented 1 day ago

There's something weird with the tokenizer that the model uses:

Even though the vocab size is 128000, the special added token indices exceed that range.

I added a printout of the # tokens allowed in each timestep, and their min-max values:

In the first step where a special token that exceeds the range is allowed, the error is immediately triggered. From what I understand, the vocab_size is supposed to include the special tokens.

My guess is, that somewhere down the line, for the prefix_function support, the transformers engine allocates a buffer of size tokenizer.vocab_size, and then when the prefix function retuns the list of tokens (which in the last timestep's case, exceeds the max index), and when the allowed logit list is applied, the out of bounds error is thrown.

This looks to be a bug in transformers lib or in the tokenizer of this specific model.

ayylemao commented 1 day ago

Thank you for your answer. Since I'm trying to raise the issue to transformers or Llama3.2 maintainers, I'm trying to pinpoint the problem. For consistency, I looked at the vocab of Meta-Llama-3.1-8B-Instruct which looks exactly the same as the one for 3.2-11B but here the prefix_function works perfectly.

The code used for minimal example:

import json
import torch
from pydantic import BaseModel
from typing import List
from lmformatenforcer import JsonSchemaParser
from lmformatenforcer.integrations.transformers import build_transformers_prefix_allowed_tokens_fn
import transformers

class Brand(BaseModel): 
    brands: List[str]

model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="cuda:0",
)

schema = Brand.model_json_schema()
parser = JsonSchemaParser(schema)
prefix_func = build_transformers_prefix_allowed_tokens_fn(pipeline.tokenizer, parser)

user = '''Tell me what brands are provided in this list: ["Microsoft", "Apple", "Intel"]'''

messages = [
    {"role": "user", "content": [
        {"type": "text", "text": user+json.dumps(schema)}
    ]}
]
result = pipeline(messages, 
         prefix_allowed_tokens_fn=prefix_func,
         max_new_tokens=256
)

And if we look at the tokenizer it looks consistent with 3.2. Also with vocab_size of 128000 and special tokens added exceeding the range:

This does not look like your bug, but can you give me some more context that I can create a good issue for transformers/llama3.2?

noamgat / lm-format-enforcer

RuntimeError: CUDA error: device-side assert triggered when used together with LLama3.2-8B-Instruct #143