Inference stuck！ - Githubissues

System Info

onnxoptimizer           0.2.7
optimum                 1.12.0
system: centos7

Problem: Convert llama2 to onnx, and then inferencing stuck！

inference code:

from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("./hugg_face_with_past/")
# if with past: use_cache=True, use_io_binding=True
model = ORTModelForCausalLM.from_pretrained("./hugg_face_with_past/", use_cache=True, use_io_binding=True)
inputs = tokenizer("My name is Arthur and I live in", return_tensors="pt")
inputs.pop('token_type_ids')

gen_tokens = model.generate(**inputs)
print(tokenizer.batch_decode(gen_tokens))

log:

2023-09-07 19:37:07.305776: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[2023-09-07 19:37:08,463] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/centos/.conda/envs/llms/lib/python3.9/site-packages/transformers/generation/utils.py:1338: UserWarning: Using `max_length`'s default (4096) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(



### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction (minimal, reproducible, runnable)

..

### Expected behavior

Inference success!

huggingface / optimum

Inference stuck！ #1362

System Info