huggingface / optimum

🚀 Accelerate training and inference of 🤗 Transformers and 🤗 Diffusers with easy to use hardware optimization tools
https://huggingface.co/docs/optimum/main/
Apache License 2.0
2.57k stars 470 forks source link

Inference stuck! #1362

Open empty2enrich opened 1 year ago

empty2enrich commented 1 year ago

System Info

onnxoptimizer           0.2.7
optimum                 1.12.0
system: centos7

Problem: Convert llama2 to onnx, and then inferencing stuck!

inference code:

from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("./hugg_face_with_past/")
# if with past: use_cache=True, use_io_binding=True
model = ORTModelForCausalLM.from_pretrained("./hugg_face_with_past/", use_cache=True, use_io_binding=True)
inputs = tokenizer("My name is Arthur and I live in", return_tensors="pt")
inputs.pop('token_type_ids')

gen_tokens = model.generate(**inputs)
print(tokenizer.batch_decode(gen_tokens))

log:

2023-09-07 19:37:07.305776: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[2023-09-07 19:37:08,463] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/centos/.conda/envs/llms/lib/python3.9/site-packages/transformers/generation/utils.py:1338: UserWarning: Using `max_length`'s default (4096) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(


### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction (minimal, reproducible, runnable)

..

### Expected behavior

Inference success!
fxmarty commented 1 year ago

Hi @empty2enrich, which model are you using? Do you face the same issue when loading with transformers AutoModelForCausalLM?

Something you could try is to add for example max_new_tokens=3 to the generate call to check whether it is not simply the model generating a very long sequence. From your log (Usingmax_length's default (4096)) it could very well be the issue.