huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.75k stars 1.02k forks source link

Continuous Batching Causes some Generations to Cut Short #345

Closed sam-h-bean closed 2 weeks ago

sam-h-bean commented 1 year ago

System Info

text-generation-inference: 0.6.0 Vicuna 13B Bottlerocket OS A10 GPU EKS

Information

Tasks

Reproduction

We have a large number of concurrent users and the continuous batching algorithm seems to cut some messages short during high traffic. I am trying to diagnose what is causing this since there aren't any stop tokens that are present from my analysis (the generation stops suddenly in the middle of a sentence). I am wondering if there is something within the token budgeting algorithm which might cause some generations to see less tokens generated than others.

Expected behavior

Generations continue until EOS token is generated of max new tokens is reached.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.