Open noamgat opened 9 months ago
I meet the same problem on qwen72b model.
I had the same problem with CodeLlama34b-Python-hf
Have you fixed the issue? I can't run any model with TP > 1
Any issues related to #4431? I finally just got --tensor-parallel-size 2
working. Tested against a bunch of models and it's solid.
@chrisbraddock Could you post minimal working code, please? And also, are running in the official vLLM docker container? If not, how did you install vLLM (from source, from pypi)? Are you running locally, or on a cloud instance?
@chrisbraddock Could you post minimal working code, please? And also, are running in the official vLLM docker container? If not, how did you install vLLM (from source, from pypi)? Are you running locally, or on a cloud instance?
@RomanKoshkin I've tried a few ways. What I have working now is pip installing the 0.4.2 tag. I have it broken in to a few scripts, so this will look a little strange, but it's copy/paste:
# create conda env
export ENV_NAME=vllm-pip-install
conda create --name ${ENV_NAME} python=3.8
# activate the conda env ... not scripted
# install vLLM
export TAG=0.4.2
pip install -vvv vllm==${TAG}
# start Ray - https://github.com/vllm-project/vllm/issues/4431#issuecomment-2084839647
export NUM_CPUS=10
ray start --head --num-cpus=$NUM_CPUS
# start vLLM
# model defaults
export DYPTE=auto
export QUANT=gptq_marlin
export NUM_GPUS=2
# this is the line that fixed my CUDA issues:
export LD_LIBRARY_PATH=$HOME/.config/vllm/nccl/cu12:$LD_LIBRARY_PATH
export MODEL=facebook/opt-125m
# start OpenAI compatible server
#
# https://docs.vllm.ai/en/latest/models/engine_args.html
python -m vllm.entrypoints.openai.api_server \
--model $MODEL \
--dtype $DYPTE \
--tensor-parallel-size $NUM_GPUS \
--quantization $QUANT
@chrisbraddock I got it working in a very similar way (I described it here). The trick was to run ray
in a separate terminal session and specify LD_LIBRARY_PATH
correctly.
@chrisbraddock I got it working in a very similar way (I described it here).
@RomanKoshkin I was working off some of your info for sure. I didn't quite understand what you did with the library so ended up with the path modification.
Next to re-enable Fast Attention and see if anything breaks. I think that's my last outstanding issue.
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Hi, I am trying to set up vLLM Mixtral 8x7b on GCP. I have a VM with two A100 80GBs, and am using the following setup:
docker image: vllm/vllm-openai:v0.3.0 Model: mistralai/Mixtral-8x7B-Instruct-v0.1
Command I use inside the vm:
python3 -m vllm.entrypoints.openai.api_server --model mistralai/Mixtral-8x7B-Instruct-v0.1 --tensor-parallel-size 2 --port 8888
Output (after a while):
nvidia-smi output:
What's wrong? Is this a bug in vLLM?
Additional diagnostics: