vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.88k stars 4.51k forks source link

Key Error when handle multiple requests simultaneously #1200

Open nutmilk10 opened 1 year ago

nutmilk10 commented 1 year ago

There was a previous thread but I didnt see a resolution, running into this issue:

OS: Ubuntu 20.04 CUDA Version: 11.2 CPU: 30 CPU RAM: 200 GPU: 8, Tesla V100-SXM2

vllm = LLM(model="mosaicml/mpt-7b-instruct", trust_remote_code=True,dtype="float16",tensor_parallel_size=1)
summary_prompt = """
    Summarize the message below, delimited by triple backticks, using short bullet points.
    ```{message}```
    BULLET POINT SUMMARY:
"""

generated_summary = summarize( vllm, summary_prompt, sampling_params)

  File "/usr/local/lib/python3.8/dist-packages/vllm/entrypoints/llm.py", line 130, in generate
    return self._run_engine(use_tqdm)
  File "/usr/local/lib/python3.8/dist-packages/vllm/entrypoints/llm.py", line 150, in _run_engine
    step_outputs = self.llm_engine.step()
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py", line 559, in step
    return self._process_model_outputs(output, scheduler_outputs)
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py", line 518, in _process_model_outputs
    self._process_sequence_group_samples(seq_group, samples)
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py", line 357, in _process_sequence_group_samples
    parent_child_dict[sample.parent_seq_id].append(sample)
KeyError: 266
viktor-ferenczi commented 1 year ago

Please provide the OS, CUDA version, CPU, CPU RAM, GPU(s), GPU VRAM sizes, command line you started the vLLM with, model used, prompt(s) and the full vLLM log output for diagnosis.

nutmilk10 commented 1 year ago

Please provide the OS, CUDA version, CPU, CPU RAM, GPU(s), GPU VRAM sizes, command line you started the vLLM with, model used, prompt(s) and the full vLLM log output for diagnosis.

OS: Ubuntu 20.04 CUDA Version: 11.2 CPU: 30 CPU RAM: 200 GPU: 8, Tesla V100-SXM2

vllm = LLM(model="mosaicml/mpt-7b-instruct", trust_remote_code=True,dtype="float16",tensor_parallel_size=1)
summary_prompt = """
    Summarize the message below, delimited by triple backticks, using short bullet points.
    ```{message}```
    BULLET POINT SUMMARY:
"""

generated_summary = summarize( vllm, summary_prompt, sampling_params)

  File "/usr/local/lib/python3.8/dist-packages/vllm/entrypoints/llm.py", line 130, in generate
    return self._run_engine(use_tqdm)
  File "/usr/local/lib/python3.8/dist-packages/vllm/entrypoints/llm.py", line 150, in _run_engine
    step_outputs = self.llm_engine.step()
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py", line 559, in step
    return self._process_model_outputs(output, scheduler_outputs)
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py", line 518, in _process_model_outputs
    self._process_sequence_group_samples(seq_group, samples)
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py", line 357, in _process_sequence_group_samples
    parent_child_dict[sample.parent_seq_id].append(sample)
KeyError: 266
viktor-ferenczi commented 1 year ago

What does this mean? CPU: 30

What's the CPU model in /proc/cpuinfo output?

Is the problem reproducible?

What are your command line arguments when vLLM is invoked?

nutmilk10 commented 1 year ago

What does this mean? CPU: 30

What's the CPU model in /proc/cpuinfo output?

Is the problem reproducible?

What are your command line arguments when vLLM is invoked?

Sorry, thought you were asking for the number of cpus, the cpu model is Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz. The issue is reproducible, and I am running from a jupyter notebook when invoking vllm

What does this mean? CPU: 30

What's the CPU model in /proc/cpuinfo output?

Is the problem reproducible?

What are your command line arguments when vLLM is invoked?

What does this mean? CPU: 30

What's the CPU model in /proc/cpuinfo output?

Is the problem reproducible?

What are your command line arguments when vLLM is invoked?

nehalvaghasiya commented 1 year ago

@nutmilk10 @viktor-ferenczi I am having the same issue (Keyerror) while handling the multiple requests simultaneously.

zhuofan-16 commented 1 year ago

Encountered the same issue

zhuofan-16 commented 1 year ago

File "/home/praveengovi_nlp/virtual_envs/venv_aiaas_llm/lib/python3.10/site-packages/langchain/chains/base.py", line 373, in acall await self._acall(inputs, run_manager=run_manager) File "/home/praveengovi_nlp/virtual_envs/venv_aiaas_llm/lib/python3.10/site-packages/langchain/chains/llm.py", line 239, in _acall response = await self.agenerate([inputs], run_manager=run_manager) File "/home/praveengovi_nlp/virtual_envs/venv_aiaas_llm/lib/python3.10/site-packages/langchain/chains/llm.py", line 117, in agenerate return await self.llm.agenerate_prompt( File "/home/praveengovi_nlp/virtual_envs/venv_aiaas_llm/lib/python3.10/site-packages/langchain/llms/base.py", line 507, in agenerate_prompt return await self.agenerate( File "/home/praveengovi_nlp/virtual_envs/venv_aiaas_llm/lib/python3.10/site-packages/langchain/llms/base.py", line 813, in agenerate output = await self._agenerate_helper( File "/home/praveengovi_nlp/virtual_envs/venv_aiaas_llm/lib/python3.10/site-packages/langchain/llms/base.py", line 701, in _agenerate_helper raise e File "/home/praveengovi_nlp/virtual_envs/venv_aiaas_llm/lib/python3.10/site-packages/langchain/llms/base.py", line 688, in _agenerate_helper await self._agenerate( File "/home/praveengovi_nlp/virtual_envs/venv_aiaas_llm/lib/python3.10/site-packages/langchain/llms/base.py", line 467, in _agenerate return await asyncio.get_running_loop().run_in_executor( File "/opt/conda/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/home/praveengovi_nlp/AIaaS_Projects/AIaas_LLM/AIaaS_LLM/src/core/controller/orchestration_layer/llm_adaptation.py", line 142, in _generate outputs = self.client.generate(prompts, sampling_params) File "/home/praveengovi_nlp/virtual_envs/venv_aiaas_llm/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 157, in generate return self._run_engine(use_tqdm) File "/home/praveengovi_nlp/virtual_envs/venv_aiaas_llm/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 177, in _run_engine step_outputs = self.llm_engine.step() File "/home/praveengovi_nlp/virtual_envs/venv_aiaas_llm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 570, in step return self._process_model_outputs(output, scheduler_outputs) + ignored File "/home/praveengovi_nlp/virtual_envs/venv_aiaas_llm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 530, in _process_model_outputs self._process_sequence_group_outputs(seq_group, outputs) File "/home/praveengovi_nlp/virtual_envs/venv_aiaas_llm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 369, in _process_sequence_group_outputs parent_child_dict[sample.parent_seq_id].append(sample) KeyError: 1

Isydmr commented 1 year ago

I'm encountering the same issue when handling multiple requests 👀

My deployment only has a single Gunicorn worker and 4 threads.

Is there any possible fix, @WoosukKwon?

chris-cortner commented 11 months ago

I am also receiving this error when handling multiple requests.

wolaiye1010 commented 11 months ago

Encountered the same issue,@nutmilk10 @viktor-ferenczi Is there a quick way to handle it?

wolaiye1010 commented 11 months ago

@viktor-ferenczi , i test: if you use api_server start it will be ok: python -m vllm.entrypoints.api_server but if you use from vllm import LLM, SamplingParams llm = LLM(model="qwen/Qwen-7B-Chat", revision="v1.1.8", trust_remote_code=True) llm.generate will trigger this error,i donnot know what is the different of them?

novag commented 9 months ago

You are most probably calling LLM.generate() from different threads. The LLM class is not thread-safe. Use AsyncLLMEngine and run_coroutine_threadsafe, if applicable, instead.

mzq20180601 commented 8 months ago

but I'm still curious, if LLM.generate() is not thread-safe, why call function '_add_request' in genrate()

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!