huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
130k stars 25.84k forks source link

Index out of range when generate using optimum #31551

Closed Ce-daros closed 3 hours ago

Ce-daros commented 1 month ago

System Info

Who can help?

@ArthurZucker @zucchini-nlp

Information

Tasks

Reproduction

Execute the code below:

import time
from transformers import AutoConfig, AutoTokenizer
from optimum.intel.openvino import OVModelForCausalLM

ov_config = {
    "PERFORMANCE_HINT": "LATENCY",
    "NUM_STREAMS": "1",
    "CACHE_DIR": "cache",
    "INFERENCE_PRECISION_HINT": "f16",
}

tok = AutoTokenizer.from_pretrained(
    "llama-3-8b-instruct-openvino-int4", trust_remote_code=True
)

ov_model = OVModelForCausalLM.from_pretrained(
    "llama-3-8b-instruct-openvino-int4",
    device="GPU",
    ov_config=ov_config,
    config=AutoConfig.from_pretrained(
        "llama-3-8b-instruct-openvino-int4", trust_remote_code=True
    ),
    trust_remote_code=True,
)

chat = [
    {"role": "system", "content": "You are an AI assistant that act like a pirate."},
    {"role": "user", "content": "Hey pirate, write a long diary of your pirate life!"},
]
prompt = tok.apply_chat_template(chat, tokenize=False)

input_tokens = tok(prompt, return_tensors="pt")

# 记录生成前的时间
start_time = time.time()

answer = ov_model.generate(
    **input_tokens,
    max_new_tokens=200,
    do_sample=True,
    temperature=0.6,
    prompt_lookup_num_tokens=3,
)

# 记录生成后的时间
end_time = time.time()

# 计算生成的 token 数量
num_tokens_generated = len(answer[0]) - len(input_tokens["input_ids"][0])

# 计算每秒生成的 token 数量
tokens_per_second = num_tokens_generated / (end_time - start_time)

print(tok.batch_decode(answer, skip_special_tokens=True)[0])
print(f"Tokens per second: {tokens_per_second:.2f}")
(base) PS D:\AI> python .\inference.py
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The argument `trust_remote_code` is to be used along with export=True. It will be ignored.
Compiling the model to GPU ...
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Traceback (most recent call last):
  File "D:\AI\inference.py", line 38, in <module>
    answer = ov_model.generate(
             ^^^^^^^^^^^^^^^^^^
  File "C:\Users\spawn\AppData\Roaming\Python\Python312\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\spawn\AppData\Roaming\Python\Python312\site-packages\optimum\intel\openvino\modeling_decoder.py", line 651, in generate
    result = super().generate(
             ^^^^^^^^^^^^^^^^^
  File "C:\Users\spawn\AppData\Roaming\Python\Python312\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\spawn\AppData\Roaming\Python\Python312\site-packages\transformers\generation\utils.py", line 1717, in generate
    result = self._assisted_decoding(
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\spawn\AppData\Roaming\Python\Python312\site-packages\transformers\generation\utils.py", line 3512, in _assisted_decoding
    outputs.past_key_values = _crop_past_key_values(self, outputs.past_key_values, new_cache_size)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\spawn\AppData\Roaming\Python\Python312\site-packages\transformers\generation\candidate_generator.py", line 399, in _crop_past_key_values 
    past_key_values[idx][0][:, :, :maximum_length, :],
    ~~~~~~~~~~~~~~~~~~~~^^^
IndexError: tuple index out of range

Expected behavior

Generate output

zucchini-nlp commented 1 month ago

I am not sure I can reproduce/run it w/o access to intel hardware. Also cc @gante

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.