Index out of range when generate using optimum

Ce-daros commented 1 month ago

System Info

transformers version: 4.41.2
Platform: Windows-11-10.0.22631-SP0
Python version: 3.12.3
Huggingface_hub version: 0.23.4
Safetensors version: 0.4.3
Accelerate version: not installed
Accelerate config: not found
PyTorch version (GPU?): 2.3.1+cpu (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@ArthurZucker @zucchini-nlp

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Execute the code below:

import time
from transformers import AutoConfig, AutoTokenizer
from optimum.intel.openvino import OVModelForCausalLM

ov_config = {
    "PERFORMANCE_HINT": "LATENCY",
    "NUM_STREAMS": "1",
    "CACHE_DIR": "cache",
    "INFERENCE_PRECISION_HINT": "f16",
}

tok = AutoTokenizer.from_pretrained(
    "llama-3-8b-instruct-openvino-int4", trust_remote_code=True
)

ov_model = OVModelForCausalLM.from_pretrained(
    "llama-3-8b-instruct-openvino-int4",
    device="GPU",
    ov_config=ov_config,
    config=AutoConfig.from_pretrained(
        "llama-3-8b-instruct-openvino-int4", trust_remote_code=True
    ),
    trust_remote_code=True,
)

chat = [
    {"role": "system", "content": "You are an AI assistant that act like a pirate."},
    {"role": "user", "content": "Hey pirate, write a long diary of your pirate life!"},
]
prompt = tok.apply_chat_template(chat, tokenize=False)

input_tokens = tok(prompt, return_tensors="pt")

# 记录生成前的时间
start_time = time.time()

answer = ov_model.generate(
    **input_tokens,
    max_new_tokens=200,
    do_sample=True,
    temperature=0.6,
    prompt_lookup_num_tokens=3,
)

# 记录生成后的时间
end_time = time.time()

# 计算生成的 token 数量
num_tokens_generated = len(answer[0]) - len(input_tokens["input_ids"][0])

# 计算每秒生成的 token 数量
tokens_per_second = num_tokens_generated / (end_time - start_time)

print(tok.batch_decode(answer, skip_special_tokens=True)[0])
print(f"Tokens per second: {tokens_per_second:.2f}")

(base) PS D:\AI> python .\inference.py
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The argument `trust_remote_code` is to be used along with export=True. It will be ignored.
Compiling the model to GPU ...
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Traceback (most recent call last):
  File "D:\AI\inference.py", line 38, in <module>
    answer = ov_model.generate(
             ^^^^^^^^^^^^^^^^^^
  File "C:\Users\spawn\AppData\Roaming\Python\Python312\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\spawn\AppData\Roaming\Python\Python312\site-packages\optimum\intel\openvino\modeling_decoder.py", line 651, in generate
    result = super().generate(
             ^^^^^^^^^^^^^^^^^
  File "C:\Users\spawn\AppData\Roaming\Python\Python312\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\spawn\AppData\Roaming\Python\Python312\site-packages\transformers\generation\utils.py", line 1717, in generate
    result = self._assisted_decoding(
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\spawn\AppData\Roaming\Python\Python312\site-packages\transformers\generation\utils.py", line 3512, in _assisted_decoding
    outputs.past_key_values = _crop_past_key_values(self, outputs.past_key_values, new_cache_size)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\spawn\AppData\Roaming\Python\Python312\site-packages\transformers\generation\candidate_generator.py", line 399, in _crop_past_key_values 
    past_key_values[idx][0][:, :, :maximum_length, :],
    ~~~~~~~~~~~~~~~~~~~~^^^
IndexError: tuple index out of range

Expected behavior

Generate output

zucchini-nlp commented 1 month ago

I am not sure I can reproduce/run it w/o access to intel hardware. Also cc @gante

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers