Open nutmilk10 opened 1 year ago
Please provide the OS, CUDA version, CPU, CPU RAM, GPU(s), GPU VRAM sizes, command line you started the vLLM with, model used, prompt(s) and the full vLLM log output for diagnosis.
Please provide the OS, CUDA version, CPU, CPU RAM, GPU(s), GPU VRAM sizes, command line you started the vLLM with, model used, prompt(s) and the full vLLM log output for diagnosis.
OS: Ubuntu 20.04 CUDA Version: 11.2 CPU: 30 CPU RAM: 200 GPU: 8, Tesla V100-SXM2
vllm = LLM(model="mosaicml/mpt-7b-instruct", trust_remote_code=True,dtype="float16",tensor_parallel_size=1)
summary_prompt = """
Summarize the message below, delimited by triple backticks, using short bullet points.
```{message}```
BULLET POINT SUMMARY:
"""
generated_summary = summarize( vllm, summary_prompt, sampling_params)
File "/usr/local/lib/python3.8/dist-packages/vllm/entrypoints/llm.py", line 130, in generate
return self._run_engine(use_tqdm)
File "/usr/local/lib/python3.8/dist-packages/vllm/entrypoints/llm.py", line 150, in _run_engine
step_outputs = self.llm_engine.step()
File "/usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py", line 559, in step
return self._process_model_outputs(output, scheduler_outputs)
File "/usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py", line 518, in _process_model_outputs
self._process_sequence_group_samples(seq_group, samples)
File "/usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py", line 357, in _process_sequence_group_samples
parent_child_dict[sample.parent_seq_id].append(sample)
KeyError: 266
What does this mean? CPU: 30
What's the CPU model in /proc/cpuinfo
output?
Is the problem reproducible?
What are your command line arguments when vLLM is invoked?
What does this mean?
CPU: 30
What's the CPU model in
/proc/cpuinfo
output?Is the problem reproducible?
What are your command line arguments when vLLM is invoked?
Sorry, thought you were asking for the number of cpus, the cpu model is Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz. The issue is reproducible, and I am running from a jupyter notebook when invoking vllm
What does this mean?
CPU: 30
What's the CPU model in
/proc/cpuinfo
output?Is the problem reproducible?
What are your command line arguments when vLLM is invoked?
What does this mean?
CPU: 30
What's the CPU model in
/proc/cpuinfo
output?Is the problem reproducible?
What are your command line arguments when vLLM is invoked?
@nutmilk10 @viktor-ferenczi I am having the same issue (Keyerror) while handling the multiple requests simultaneously.
Encountered the same issue
File "/home/praveengovi_nlp/virtual_envs/venv_aiaas_llm/lib/python3.10/site-packages/langchain/chains/base.py", line 373, in acall await self._acall(inputs, run_manager=run_manager) File "/home/praveengovi_nlp/virtual_envs/venv_aiaas_llm/lib/python3.10/site-packages/langchain/chains/llm.py", line 239, in _acall response = await self.agenerate([inputs], run_manager=run_manager) File "/home/praveengovi_nlp/virtual_envs/venv_aiaas_llm/lib/python3.10/site-packages/langchain/chains/llm.py", line 117, in agenerate return await self.llm.agenerate_prompt( File "/home/praveengovi_nlp/virtual_envs/venv_aiaas_llm/lib/python3.10/site-packages/langchain/llms/base.py", line 507, in agenerate_prompt return await self.agenerate( File "/home/praveengovi_nlp/virtual_envs/venv_aiaas_llm/lib/python3.10/site-packages/langchain/llms/base.py", line 813, in agenerate output = await self._agenerate_helper( File "/home/praveengovi_nlp/virtual_envs/venv_aiaas_llm/lib/python3.10/site-packages/langchain/llms/base.py", line 701, in _agenerate_helper raise e File "/home/praveengovi_nlp/virtual_envs/venv_aiaas_llm/lib/python3.10/site-packages/langchain/llms/base.py", line 688, in _agenerate_helper await self._agenerate( File "/home/praveengovi_nlp/virtual_envs/venv_aiaas_llm/lib/python3.10/site-packages/langchain/llms/base.py", line 467, in _agenerate return await asyncio.get_running_loop().run_in_executor( File "/opt/conda/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/home/praveengovi_nlp/AIaaS_Projects/AIaas_LLM/AIaaS_LLM/src/core/controller/orchestration_layer/llm_adaptation.py", line 142, in _generate outputs = self.client.generate(prompts, sampling_params) File "/home/praveengovi_nlp/virtual_envs/venv_aiaas_llm/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 157, in generate return self._run_engine(use_tqdm) File "/home/praveengovi_nlp/virtual_envs/venv_aiaas_llm/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 177, in _run_engine step_outputs = self.llm_engine.step() File "/home/praveengovi_nlp/virtual_envs/venv_aiaas_llm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 570, in step return self._process_model_outputs(output, scheduler_outputs) + ignored File "/home/praveengovi_nlp/virtual_envs/venv_aiaas_llm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 530, in _process_model_outputs self._process_sequence_group_outputs(seq_group, outputs) File "/home/praveengovi_nlp/virtual_envs/venv_aiaas_llm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 369, in _process_sequence_group_outputs parent_child_dict[sample.parent_seq_id].append(sample) KeyError: 1
I'm encountering the same issue when handling multiple requests 👀
My deployment only has a single Gunicorn worker and 4 threads.
Is there any possible fix, @WoosukKwon?
I am also receiving this error when handling multiple requests.
Encountered the same issue,@nutmilk10 @viktor-ferenczi Is there a quick way to handle it?
@viktor-ferenczi , i test:
if you use api_server start it will be ok:
python -m vllm.entrypoints.api_server
but if you use
from vllm import LLM, SamplingParams llm = LLM(model="qwen/Qwen-7B-Chat", revision="v1.1.8", trust_remote_code=True) llm.generate
will trigger this error,i donnot know what is the different of them?
You are most probably calling LLM.generate() from different threads. The LLM class is not thread-safe. Use AsyncLLMEngine and run_coroutine_threadsafe, if applicable, instead.
but I'm still curious, if LLM.generate() is not thread-safe, why call function '_add_request' in genrate()
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
There was a previous thread but I didnt see a resolution, running into this issue:
OS: Ubuntu 20.04 CUDA Version: 11.2 CPU: 30 CPU RAM: 200 GPU: 8, Tesla V100-SXM2