Closed CsRic closed 2 months ago
Minimal reproducer:
import numpy as np
from vllm import LLM, SamplingParams
def main():
llm = LLM(model="facebook/opt-125m")
test_prompts = np.random.randint(10000, size=(200, 32)).tolist()
outputs = llm.generate(
prompt_token_ids=test_prompts,
sampling_params=SamplingParams(
temperature=0.0, logprobs=1, prompt_logprobs=1
),
)
for output in outputs:
print(output)
if __name__ == "__main__":
main()
It's affecting using EleutherAI/lm-evaluation-harness
with vLLM.
Your current environment
🐛 Describe the bug
I modified
examples/llm_engine_example.py
to test a large number of requests. With 200 requests of 32 random tokens, the engine get stuck and never produce a full answer.llm_engine_example_heavy.py
:A successful run:
output:
...
I omitted the rest. All answers are printed. The program terminated normally.
A failed run, change
--test-num
from 10 to 200:output:
Before I hit ctrl+c, the program stuck for 1 hour. The GPU activiy is 0%. The traceback always show
self.backend_tokenizer.decoder.decode(tokens)
as the latest position.