triton-inference-server / fastertransformer_backend

BSD 3-Clause "New" or "Revised" License
411 stars 133 forks source link

How can I get stuck during generation? #131

Open amazingkmy opened 1 year ago

amazingkmy commented 1 year ago

Description

main, A100

Reproduced Steps

Hi I'm experimenting with gpt models using triton + fastertransformer_backend. 
I installed it according to the docs/gpt_guide.md in the docs and was able to verify the fast generation.
My question is, I set the value of request_output_len during input to 256. when i request triton, the actual output_len of my model is 56. The remaining 200 was filled eos_token.
I think, As triton calculated 200 eos_tokens and served the result, I realized that there were unnecessary calculations going on.
I don't want to do unnecessary calculations. Is triton have a early stop generation?

Translated with www.DeepL.com/Translator (free version)