--tensor-parallel-size 2 fails to load on GCP

noamgat commented 9 months ago

Hi, I am trying to set up vLLM Mixtral 8x7b on GCP. I have a VM with two A100 80GBs, and am using the following setup:

docker image: vllm/vllm-openai:v0.3.0 Model: mistralai/Mixtral-8x7B-Instruct-v0.1

Command I use inside the vm:

python3 -m vllm.entrypoints.openai.api_server --model mistralai/Mixtral-8x7B-Instruct-v0.1 --tensor-parallel-size 2 --port 8888

Output (after a while):

File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 1858, in softmax
    ret = input.softmax(dim, dtype=dtype)
RuntimeError: CUDA error: invalid device function

nvidia-smi output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:06.0 Off |                    0 |
| N/A   32C    P0    62W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  Off  | 00000000:00:07.0 Off |                    0 |
| N/A   31C    P0    61W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

What's wrong? Is this a bug in vLLM?

Additional diagnostics:

Mistral Instruct 7B fails with the same error
Without tensor parallelism, it succeeds. (Not an option for 8x7B as it doesn't fit on one GPU)

thisissum commented 9 months ago

I meet the same problem on qwen72b model.

OAHC2022 commented 9 months ago

I had the same problem with CodeLlama34b-Python-hf

RomanKoshkin commented 6 months ago

Have you fixed the issue? I can't run any model with TP > 1

chrisbraddock commented 6 months ago

Any issues related to #4431? I finally just got --tensor-parallel-size 2 working. Tested against a bunch of models and it's solid.

RomanKoshkin commented 6 months ago

@chrisbraddock Could you post minimal working code, please? And also, are running in the official vLLM docker container? If not, how did you install vLLM (from source, from pypi)? Are you running locally, or on a cloud instance?

chrisbraddock commented 6 months ago

@chrisbraddock Could you post minimal working code, please? And also, are running in the official vLLM docker container? If not, how did you install vLLM (from source, from pypi)? Are you running locally, or on a cloud instance?

@RomanKoshkin I've tried a few ways. What I have working now is pip installing the 0.4.2 tag. I have it broken in to a few scripts, so this will look a little strange, but it's copy/paste:

# create conda env
export ENV_NAME=vllm-pip-install
conda create --name ${ENV_NAME} python=3.8

# activate the conda env ... not scripted

# install vLLM
export TAG=0.4.2
pip install -vvv vllm==${TAG}

# start Ray - https://github.com/vllm-project/vllm/issues/4431#issuecomment-2084839647
export NUM_CPUS=10
ray start --head --num-cpus=$NUM_CPUS

# start vLLM
# model defaults
export DYPTE=auto
export QUANT=gptq_marlin

export NUM_GPUS=2

# this is the line that fixed my CUDA issues:
export LD_LIBRARY_PATH=$HOME/.config/vllm/nccl/cu12:$LD_LIBRARY_PATH

export MODEL=facebook/opt-125m

# start OpenAI compatible server
#
# https://docs.vllm.ai/en/latest/models/engine_args.html
python -m vllm.entrypoints.openai.api_server \
    --model $MODEL \
    --dtype $DYPTE \
    --tensor-parallel-size $NUM_GPUS \
    --quantization $QUANT

RomanKoshkin commented 6 months ago

@chrisbraddock I got it working in a very similar way (I described it here). The trick was to run ray in a separate terminal session and specify LD_LIBRARY_PATH correctly.

chrisbraddock commented 6 months ago

@chrisbraddock I got it working in a very similar way (I described it here).

@RomanKoshkin I was working off some of your info for sure. I didn't quite understand what you did with the library so ended up with the path modification.

Next to re-enable Fast Attention and see if anything breaks. I think that's my last outstanding issue.

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

vllm-project / vllm

--tensor-parallel-size 2 fails to load on GCP #2906