Closed geekifan closed 1 month ago
model.generate
generates up to 60 tokens when max_new_tokens=60
, while model(...)
only generates the next 1 token.
Setting max_new_tokens=1
will make things about equal.
Right, calling generation with only one token should take equal time. Let me know if it doesn't
@zucchini-nlp @yonikremer Thanks for your reply! Taking max_new_tokens=1
solves the problem. The speed is roughly equal using both methods. Sorry for my misunderstanding of max_new_tokens
. I thought it would always generate one word/token if I told it to do so no matter what max_new_tokens
is.
System Info
transformers=4.44.0 python=3.11 cuda=12.4
Who can help?
@zucchini-nlp
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I need to get the hidden states when the model outputs the next token. I compare model.generate with model forward and find that the speed of model forward call is faster than model.generate.
Example: Model: llava next Prompt: \<image>\n Summarize it in one word: model forward call:
model.generate:
model forward call can process 1000 samples in 2 minutes but model.generate needs 20 minutes which is 10x slower than model forward call.
Expected behavior
model.generate should have the same speed as model forward call