vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
23.29k stars 3.31k forks source link

[Bug]: RuntimeError: CUDA error: invalid device ordinal with multi node multi gpus #3722

Open kn1011 opened 3 months ago

kn1011 commented 3 months ago

Your current environment

vllm(0.3.3) on ray(2.10.0) cluster deployed by docker on 2 nodes with 2 GPU(Tesla T4) each.

linux environment root@ai151:/vllm-workspace# env
NV_LIBCUBLAS_VERSION=12.1.0.26-1
NVIDIA_VISIBLE_DEVICES=all
NV_NVML_DEV_VERSION=12.1.55-1
NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.17.1-1+cuda12.1
NV_LIBNCCL_DEV_PACKAGE_VERSION=2.17.1-1
HOSTNAME=ai151
NVIDIA_REQUIRE_CUDA=cuda>=12.1 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526
NV_LIBCUBLAS_DEV_PACKAGE=libcublas-dev-12-1=12.1.0.26-1
NV_NVTX_VERSION=12.1.66-1
NV_CUDA_CUDART_DEV_VERSION=12.1.55-1
NV_LIBCUSPARSE_VERSION=12.0.2.55-1
NV_LIBNPP_VERSION=12.0.2.50-1
NCCL_VERSION=2.17.1-1
PWD=/vllm-workspace
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NV_NVPROF_DEV_PACKAGE=cuda-nvprof-12-1=12.1.55-1
NV_LIBNPP_PACKAGE=libnpp-12-1=12.0.2.50-1
NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
NV_LIBCUBLAS_DEV_VERSION=12.1.0.26-1
NVIDIA_PRODUCT_NAME=CUDA
NV_LIBCUBLAS_DEV_PACKAGE_NAME=libcublas-dev-12-1
NV_CUDA_CUDART_VERSION=12.1.55-1
HOME=/root
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:
NVIDIA_CUDA_END_OF_LIFE=1
CUDA_VERSION=12.1.0
NV_LIBCUBLAS_PACKAGE=libcublas-12-1=12.1.0.26-1
NV_CUDA_NSIGHT_COMPUTE_DEV_PACKAGE=cuda-nsight-compute-12-1=12.1.0-1
NV_LIBNPP_DEV_PACKAGE=libnpp-dev-12-1=12.0.2.50-1
NV_LIBCUBLAS_PACKAGE_NAME=libcublas-12-1
NV_LIBNPP_DEV_VERSION=12.0.2.50-1
LESSCLOSE=/usr/bin/lesspipe %s %s
TERM=xterm
NV_LIBCUSPARSE_DEV_VERSION=12.0.2.55-1
LESSOPEN=| /usr/bin/lesspipe %s
LIBRARY_PATH=/usr/local/cuda/lib64/stubs
SHLVL=1
NV_CUDA_LIB_VERSION=12.1.0-1
NVARCH=x86_64
NV_CUDA_COMPAT_PACKAGE=cuda-compat-12-1
NV_LIBNCCL_PACKAGE=libnccl2=2.17.1-1+cuda12.1
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NV_CUDA_NSIGHT_COMPUTE_VERSION=12.1.0-1
NV_NVPROF_VERSION=12.1.55-1
PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
NV_LIBNCCL_PACKAGE_NAME=libnccl2
NV_LIBNCCL_PACKAGE_VERSION=2.17.1-1
_=/usr/bin/env
pip list root@ai151:/vllm-workspace# pip list
Package Version
------------------------- ---------------
accelerate 0.28.0
aiofiles 23.2.1
aiohttp 3.9.3
aiohttp-cors 0.7.0
aiosignal 1.3.1
altair 5.2.0
annotated-types 0.6.0
anyio 4.3.0
async-timeout 4.0.3
attrs 23.2.0
awscli 1.32.70
botocore 1.34.70
cachetools 5.3.3
certifi 2024.2.2
charset-normalizer 3.3.2
click 8.1.7
cloudpickle 3.0.0
cmake 3.28.4
codespell 2.2.6
colorama 0.4.4
colorful 0.5.6
contourpy 1.2.0
cycler 0.12.1
deepspeed 0.14.0
diskcache 5.6.3
distlib 0.3.8
distro 1.9.0
docutils 0.16
einops 0.7.0
exceptiongroup 1.2.0
fastapi 0.110.0
ffmpy 0.3.2
filelock 3.13.3
flash-attn 2.5.6
fonttools 4.50.0
frozenlist 1.4.1
fsspec 2024.3.1
google-api-core 2.18.0
google-auth 2.29.0
googleapis-common-protos 1.63.0
gradio 4.24.0
gradio_client 0.14.0
grpcio 1.62.1
h11 0.14.0
hjson 3.1.0
httpcore 1.0.4
httptools 0.6.1
httpx 0.27.0
huggingface-hub 0.22.1
idna 3.6
importlib_resources 6.4.0
iniconfig 2.0.0
interegular 0.3.3
isort 5.13.2
Jinja2 3.1.3
jmespath 1.0.1
joblib 1.3.2
jsonschema 4.21.1
jsonschema-specifications 2023.12.1
kiwisolver 1.4.5
lark 1.1.9
llvmlite 0.42.0
markdown-it-py 3.0.0
MarkupSafe 2.1.5
matplotlib 3.8.3
mdurl 0.1.2
mpmath 1.3.0
msgpack 1.0.8
multidict 6.0.5
mypy 0.991
mypy-extensions 1.0.0
nest-asyncio 1.6.0
networkx 3.2.1
ninja 1.11.1.1
numba 0.59.1
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.18.1
nvidia-nvjitlink-cu12 12.4.99
nvidia-nvtx-cu12 12.1.105
openai 1.14.3
opencensus 0.11.4
opencensus-context 0.1.3
orjson 3.10.0
outlines 0.0.34
packaging 24.0
pandas 2.2.1
peft 0.10.0
pillow 10.2.0
pip 22.0.2
platformdirs 4.2.0
pluggy 1.4.0
prometheus_client 0.20.0
proto-plus 1.23.0
protobuf 4.25.3
psutil 5.9.8
py 1.11.0
py-cpuinfo 9.0.0
py-spy 0.3.14
pyasn1 0.5.1
pyasn1_modules 0.4.0
pydantic 2.6.4
pydantic_core 2.16.3
pydub 0.25.1
Pygments 2.17.2
pynvml 11.5.0
pyparsing 3.1.2
pytest 8.1.1
pytest-asyncio 0.23.6
pytest-forked 1.6.0
pytest-rerunfailures 14.0
pytest-shard 0.1.2
python-dateutil 2.9.0.post0
python-dotenv 1.0.1
python-multipart 0.0.9
pytz 2024.1
PyYAML 6.0.1
ray 2.10.0
referencing 0.34.0
regex 2023.12.25
requests 2.31.0
rich 13.7.1
rpds-py 0.18.0
rsa 4.7.2
ruff 0.3.4
s3transfer 0.10.1
safetensors 0.4.2
scipy 1.12.0
semantic-version 2.10.0
sentencepiece 0.2.0
setuptools 59.6.0
shellingham 1.5.4
six 1.16.0
smart-open 7.0.4
sniffio 1.3.1
starlette 0.36.3
sympy 1.12
tokenizers 0.15.2
toml 0.10.2
tomli 2.0.1
tomlkit 0.12.0
toolz 0.12.1
torch 2.1.2
tqdm 4.66.2
transformers 4.39.1
triton 2.1.0
typer 0.11.0
types-PyYAML 6.0.12.20240311
types-requests 2.31.0.20240311
types-setuptools 69.2.0.20240317
typing_extensions 4.10.0
tzdata 2024.1
urllib3 2.2.1
uvicorn 0.29.0
uvloop 0.19.0
virtualenv 20.25.1
vllm 0.3.3
watchfiles 0.21.0
websockets 11.0.3
wheel 0.37.1
wrapt 1.16.0
xformers 0.0.23.post1
yapf 0.32.0
yarl 1.9.4

🐛 Describe the bug

vllm works good with argument --tensor-parallel-size 2, but sucks with --tensor-parallel-size 4

RuntimeError: CUDA error: invalid device ordinal root@ai151:/vllm-workspace# python3 -m vllm.entrypoints.api_server --model /models/openchat-3.5-0106/ --tensor-parallel-size 4 --dtype float16 --enforce-eager
WARNING 03-29 13:57:06 config.py:732] Casting torch.bfloat16 to torch.float16.
2024-03-29 13:57:06,969 INFO worker.py:1567 -- Connecting to existing Ray cluster at address: 10.4.80.151:6379...
2024-03-29 13:57:06,980 INFO worker.py:1743 -- Connected to Ray cluster. View the dashboard at 10.4.80.151:8265
INFO 03-29 13:57:09 llm_engine.py:70] Initializing an LLM engine (v0.3.3) with config: model=/models/openchat-3.5-0106/, tokenizer=/models/openchat-3.5-0106/, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=4, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 03-29 13:57:22 pynccl.py:49] Loading nccl from library libnccl.so
INFO 03-29 13:57:22 pynccl_utils.py:13] vLLM is using nccl==2.17.1
INFO 03-29 13:57:23 selector.py:33] Cannot use FlashAttention backend for Volta and Turing GPUs.
INFO 03-29 13:57:23 selector.py:20] Using XFormers backend.
(RayWorkerVllm pid=392, ip=10.4.80.152) INFO 03-29 13:57:16 pynccl.py:49] Loading nccl from library libnccl.so
(RayWorkerVllm pid=392, ip=10.4.80.152) INFO 03-29 13:57:16 pynccl_utils.py:13] vLLM is using nccl==2.17.1
(RayWorkerVllm pid=11442) INFO 03-29 13:57:25 selector.py:33] Cannot use FlashAttention backend for Volta and Turing GPUs.
(RayWorkerVllm pid=11442) INFO 03-29 13:57:25 selector.py:20] Using XFormers backend.
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] Traceback (most recent call last):
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] File /usr/local/lib/python3.10/dist-packages/vllm/engine/ray_utils.py, line 37, in execute_method
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] return executor(*args, **kwargs)
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] File /usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py, line 100, in init_device
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] init_distributed_environment(self.parallel_config, self.rank,
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] File /usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py, line 286, in init_distributed_environment
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] pynccl_utils.init_process_group(
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] File /usr/local/lib/python3.10/dist-packages/vllm/model_executor/parallel_utils/pynccl_utils.py, line 42, in init_process_group
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] comm = NCCLCommunicator(init_method=init_method,
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] File /usr/local/lib/python3.10/dist-packages/vllm/model_executor/parallel_utils/pynccl.py, line 226, in __init__
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] torch.cuda.set_device(self.rank)
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] File /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py, line 404, in set_device
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] torch._C._cuda_setDevice(device)
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] RuntimeError: CUDA error: invalid device ordinal
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44]
(RayWorkerVllm pid=309, ip=10.4.80.152) Exception ignored in:
(RayWorkerVllm pid=309, ip=10.4.80.152) Traceback (most recent call last):
(RayWorkerVllm pid=309, ip=10.4.80.152) File /usr/local/lib/python3.10/dist-packages/vllm/model_executor/parallel_utils/pynccl.py, line 260, in __del__
(RayWorkerVllm pid=309, ip=10.4.80.152) _c_ncclCommDestroy(self.comm)
(RayWorkerVllm pid=309, ip=10.4.80.152) AttributeError: NCCLCommunicator object has no attribute comm
youkaichao commented 3 months ago

Hi, can you try to build from source with the latest main? https://github.com/vllm-project/vllm/pull/3686 should resolve your problem I think.

BTW, when you use two nodes, do you use ray to set up the two nodes as a cluster?

kn1011 commented 3 months ago

Hi, can you try to build from source with the latest main? #3686 should resolve your problem I think.

BTW, when you use two nodes, do you use ray to set up the two nodes as a cluster?

  1. I'll try build vllm from latest main
  2. the cluster is set up with two nodes
kn1011 commented 3 months ago

new vllm-build(0.4.0) comes new ERROR.

ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. root@ai151:/# python3 -m vllm.entrypoints.api_server --model /models/openchat-3.5-0106/ --tensor-parallel-size 4 --dtype float16 --enforce-eager
WARNING 04-01 10:49:31 config.py:748] Casting torch.bfloat16 to torch.float16.
2024-04-01 10:49:31,995 INFO worker.py:1567 -- Connecting to existing Ray cluster at address: 10.4.80.151:6379...
2024-04-01 10:49:32,005 INFO worker.py:1743 -- Connected to Ray cluster. View the dashboard at 10.4.80.151:8265
INFO 04-01 10:49:32 llm_engine.py:75] Initializing an LLM engine (v0.4.0) with config: model=/models/openchat-3.5-0106/, tokenizer=/models/openchat-3.5-0106/, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=4, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(RayWorkerVllm pid=306, ip=10.4.80.152) INFO 04-01 10:49:42 selector.py:34] Cannot use FlashAttention backend for Volta and Turing GPUs.
INFO 04-01 10:49:52 selector.py:34] Cannot use FlashAttention backend for Volta and Turing GPUs.
INFO 04-01 10:49:52 selector.py:21] Using XFormers backend.
(RayWorkerVllm pid=306, ip=10.4.80.152) INFO 04-01 10:49:42 selector.py:21] Using XFormers backend.
(RayWorkerVllm pid=11716) INFO 04-01 10:49:53 pynccl_utils.py:45] vLLM is using nccl==2.18.1
INFO 04-01 10:49:54 pynccl_utils.py:45] vLLM is using nccl==2.18.1
(RayWorkerVllm pid=394, ip=10.4.80.152) Exception ignored in:
(RayWorkerVllm pid=394, ip=10.4.80.152) Traceback (most recent call last):
(RayWorkerVllm pid=394, ip=10.4.80.152) File /usr/local/lib/python3.10/dist-packages/vllm-0.4.0-py3.10-linux-x86_64.egg/vllm/model_executor/parallel_utils/pynccl.py, line 264, in __del__
(RayWorkerVllm pid=394, ip=10.4.80.152) _c_ncclCommDestroy(self.comm)
(RayWorkerVllm pid=394, ip=10.4.80.152) AttributeError: NCCLCommunicator object has no attribute comm
(RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] Traceback (most recent call last):
(RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] File /usr/local/lib/python3.10/dist-packages/vllm-0.4.0-py3.10-linux-x86_64.egg/vllm/engine/ray_utils.py, line 37, in execute_method
(RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] return executor(*args, **kwargs)
(RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] File /usr/local/lib/python3.10/dist-packages/vllm-0.4.0-py3.10-linux-x86_64.egg/vllm/worker/worker.py, line 100, in init_device
(RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] init_distributed_environment(self.parallel_config, self.rank,
(RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] File /usr/local/lib/python3.10/dist-packages/vllm-0.4.0-py3.10-linux-x86_64.egg/vllm/worker/worker.py, line 287, in init_distributed_environment
(RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] pynccl_utils.init_process_group(
(RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] File /usr/local/lib/python3.10/dist-packages/vllm-0.4.0-py3.10-linux-x86_64.egg/vllm/model_executor/parallel_utils/pynccl_utils.py, line 46, in init_process_group
(RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] comm = NCCLCommunicator(init_method=init_method,
(RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] File /usr/local/lib/python3.10/dist-packages/vllm-0.4.0-py3.10-linux-x86_64.egg/vllm/model_executor/parallel_utils/pynccl.py, line 236, in __init__
(RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] dist.broadcast(tensor, src=0)
(RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] File /usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py, line 47, in wrapper
(RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] return func(*args, **kwargs)
(RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] File /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py, line 1906, in broadcast
(RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] work = default_pg.broadcast([tensor], opts)
(RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, remote process exited or there was a network error, NCCL version 2.18.1
(RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
(RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] Last error:
(RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] socketProgressOpt: Call to recv from 10.4.80.152<57269> failed : Broken pipe

here's build log, looks like good.

vllm master(@563c1d7ec56aa0f9fdc28720f3517bf9297f5476) build log root@7fc7fec8839f:/tmp/vllm# python3 setup.py install
No CUDA runtime is found, using CUDA_HOME=/usr/local/cuda
running install
/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/cmd.py:66: SetuptoolsDeprecationWarning: setup.py install is deprecated.
!!
********************************************************************************
Please avoid running ``setup.py`` directly.
Instead, use pypa/build, pypa/installer or other
standards-based tools.
See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details.
********************************************************************************
!!
self.initialize_options()
/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/cmd.py:66: EasyInstallDeprecationWarning: easy_install command is deprecated.
!!
********************************************************************************
Please avoid running ``setup.py`` and ``easy_install``.
Instead, use pypa/build, pypa/installer or other
standards-based tools.
See https://github.com/pypa/setuptools/issues/917 for details.
********************************************************************************
!!
self.initialize_options()
running bdist_egg
running egg_info
writing vllm.egg-info/PKG-INFO
writing dependency_links to vllm.egg-info/dependency_links.txt
writing requirements to vllm.egg-info/requires.txt
writing top-level names to vllm.egg-info/top_level.txt
reading manifest file vllm.egg-info/SOURCES.txt
reading manifest template MANIFEST.in
adding license file LICENSE
writing manifest file vllm.egg-info/SOURCES.txt
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
running build_ext
-- Build type: RelWithDebInfo
-- Found python matching: /usr/bin/python3.
-- Caffe2: CUDA detected: 12.1
-- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda
-- Caffe2: Header version is: 12.1
-- /usr/local/cuda/lib64/libnvrtc.so shorthash is b51b459d
-- USE_CUDNN is set to 0. Compiling without cuDNN support
-- USE_CUSPARSELT is set to 0. Compiling without cuSPARSELt support
-- Automatic GPU detection failed. Building for common architectures.
-- Autodetected CUDA architecture(s): 3.5;5.0;8.0;8.6;8.9;9.0
-- Added CUDA NVCC flags for: -gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_89,code=sm_89;-gencode;arch=compute_90,code=sm_90
CMake Warning at /usr/local/lib/python3.10/dist-packages/torch/share/cmake/Torch/TorchConfig.cmake:22 (message):
static library kineto_LIBRARY-NOTFOUND not found.
Call Stack (most recent call first):
/usr/local/lib/python3.10/dist-packages/torch/share/cmake/Torch/TorchConfig.cmake:127 (append_torchlib_if_found)
CMakeLists.txt:64 (find_package)
-- CUDA supported arches: 7.0;7.5;8.0;8.6;8.9;9.0
-- discarding unsupported CUDA arch 3.10.
-- discarding unsupported CUDA arch 3.10.
-- CUDA target arches: 80-real;86-real;89-real;90-real
-- Punica target arches: 80-real;86-real;89-real;90-real
-- Enabling C extension.
-- Enabling moe extension.
-- Configuring done (8.2s)
-- Generating done (0.0s)
-- Build files have been written to: /tmp/vllm/build/temp.linux-x86_64-cpython-310
[3/3] Linking CXX shared module /tmp/vllm/build/lib.linux-x86_64-cpython-310/vllm/_moe_C.cpython-310-x86_64-linux-gnu.so
[6/14] Building CUDA object CMakeFiles/_C.dir/csrc/quantization/awq/gemm_kernels.cu.o
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(282): warning #177-D: variable j_factors1 was declared but never referenced
int j_factors1 = 4;
^
Remark: The warnings can be suppressed with -diag-suppress
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(283): warning #177-D: variable row_stride2 was declared but never referenced
int row_stride2 = 4;
^
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(284): warning #177-D: variable split_k_iters was declared but never referenced
int split_k_iters = 1;
^
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(290): warning #177-D: variable B_shared_warp was declared but never referenced
half B_shared_warp[32];
^
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(291): warning #177-D: variable OC was declared but never referenced
int OC = 512;
^
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(53): warning #177-D: variable scaling_factors_shared was declared but never referenced
half scaling_factors_shared[N];
^
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(54): warning #177-D: variable zeros_shared was declared but never referenced
half zeros_shared[N];
^
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(57): warning #177-D: variable blockIdx_x was declared but never referenced
int blockIdx_x = 0;
^
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(71): warning #177-D: variable ld_zero_flag was declared but never referenced
bool ld_zero_flag = (threadIdx.y * 32 + threadIdx.x) * 8 < N;
^
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(24): warning #177-D: function vllm::awq::__pack_half2 was declared but never referenced
__pack_half2(const half x, const half y) {
^
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(282): warning #177-D: variable j_factors1 was declared but never referenced
int j_factors1 = 4;
^
Remark: The warnings can be suppressed with -diag-suppress
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(283): warning #177-D: variable row_stride2 was declared but never referenced
int row_stride2 = 4;
^
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(284): warning #177-D: variable split_k_iters was declared but never referenced
int split_k_iters = 1;
^
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(290): warning #177-D: variable B_shared_warp was declared but never referenced
half B_shared_warp[32];
^
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(291): warning #177-D: variable OC was declared but never referenced
int OC = 512;
^
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(53): warning #177-D: variable scaling_factors_shared was declared but never referenced
half scaling_factors_shared[N];
^
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(54): warning #177-D: variable zeros_shared was declared but never referenced
half zeros_shared[N];
^
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(57): warning #177-D: variable blockIdx_x was declared but never referenced
int blockIdx_x = 0;
^
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(71): warning #177-D: variable ld_zero_flag was declared but never referenced
bool ld_zero_flag = (threadIdx.y * 32 + threadIdx.x) * 8 < N;
^
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(24): warning #177-D: function vllm::awq::__pack_half2 was declared but never referenced
__pack_half2(const half x, const half y) {
^
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(282): warning #177-D: variable j_factors1 was declared but never referenced
int j_factors1 = 4;
^
Remark: The warnings can be suppressed with -diag-suppress
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(283): warning #177-D: variable row_stride2 was declared but never referenced
int row_stride2 = 4;
^
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(284): warning #177-D: variable split_k_iters was declared but never referenced
int split_k_iters = 1;
^
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(290): warning #177-D: variable B_shared_warp was declared but never referenced
half B_shared_warp[32];
^
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(291): warning #177-D: variable OC was declared but never referenced
int OC = 512;
^
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(53): warning #177-D: variable scaling_factors_shared was declared but never referenced
half scaling_factors_shared[N];
^
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(54): warning #177-D: variable zeros_shared was declared but never referenced
half zeros_shared[N];
^
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(57): warning #177-D: variable blockIdx_x was declared but never referenced
int blockIdx_x = 0;
^
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(71): warning #177-D: variable ld_zero_flag was declared but never referenced
bool ld_zero_flag = (threadIdx.y * 32 + threadIdx.x) * 8 < N;
^
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(24): warning #177-D: function vllm::awq::__pack_half2 was declared but never referenced
__pack_half2(const half x, const half y) {
^
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(282): warning #177-D: variable j_factors1 was declared but never referenced
int j_factors1 = 4;
^
Remark: The warnings can be suppressed with -diag-suppress
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(283): warning #177-D: variable row_stride2 was declared but never referenced
int row_stride2 = 4;
^
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(284): warning #177-D: variable split_k_iters was declared but never referenced
int split_k_iters = 1;
^
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(290): warning #177-D: variable B_shared_warp was declared but never referenced
half B_shared_warp[32];
^
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(291): warning #177-D: variable OC was declared but never referenced
int OC = 512;
^
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(53): warning #177-D: variable scaling_factors_shared was declared but never referenced
half scaling_factors_shared[N];
^
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(54): warning #177-D: variable zeros_shared was declared but never referenced
half zeros_shared[N];
^
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(57): warning #177-D: variable blockIdx_x was declared but never referenced
int blockIdx_x = 0;
^
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(71): warning #177-D: variable ld_zero_flag was declared but never referenced
bool ld_zero_flag = (threadIdx.y * 32 + threadIdx.x) * 8 < N;
^
/tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(24): warning #177-D: function vllm::awq::__pack_half2 was declared but never referenced
__pack_half2(const half x, const half y) {
^
[7/14] Building CUDA object CMakeFiles/_C.dir/csrc/quantization/squeezellm/quant_cuda_kernel.cu.o
/tmp/vllm/csrc/quantization/squeezellm/quant_cuda_kernel.cu: In function ‘void squeezellm_gemm(at::Tensor, at::Tensor, at::Tensor, at::Tensor)’:
/tmp/vllm/csrc/quantization/squeezellm/quant_cuda_kernel.cu:206:136: warning: ‘T* at::Tensor::data() const [with T = c10::Half]’ is deprecated: Tensor.data() is deprecated. Please use Tensor.data_ptr() instead. [-Wdeprecated-declarations]
206 | vllm::squeezellm::NUQ4MatMulKernel<<>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
247 | T * data() const {
| ^ ~~
/tmp/vllm/csrc/quantization/squeezellm/quant_cuda_kernel.cu:206:193: warning: ‘T* at::Tensor::data() const [with T = c10::Half]’ is deprecated: Tensor.data() is deprecated. Please use Tensor.data_ptr() instead. [-Wdeprecated-declarations]
206 | vllm::squeezellm::NUQ4MatMulKernel<<>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
247 | T * data() const {
| ^ ~~
/tmp/vllm/csrc/quantization/squeezellm/quant_cuda_kernel.cu:206:237: warning: ‘T* at::Tensor::data() const [with T = c10::Half]’ is deprecated: Tensor.data() is deprecated. Please use Tensor.data_ptr() instead. [-Wdeprecated-declarations]
206 | vllm::squeezellm::NUQ4MatMulKernel<<>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
247 | T * data() const {
| ^ ~~
[12/14] Building CUDA object CMakeFiles/_C.dir/csrc/quantization/marlin/marlin_cuda_kernel.cu.o
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=1, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033
Remark: The warnings can be suppressed with -diag-suppress
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=1, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=2, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=2, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=3, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=3, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=4, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=4, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=1, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=1, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=2, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=2, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=3, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=3, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=4, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=4, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=1, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=1, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=2, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=2, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=3, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=3, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=4, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=4, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=1, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=1, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=2, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=2, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=3, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=3, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=4, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=4, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=1, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033
Remark: The warnings can be suppressed with -diag-suppress
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=1, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=2, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=2, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=3, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=3, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=4, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=4, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=1, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=1, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=2, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=2, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=3, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=3, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=4, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=4, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=1, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=1, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=2, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=2, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=3, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=3, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=4, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=4, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=1, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=1, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=2, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=2, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=3, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=3, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=4, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=4, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=1, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033
Remark: The warnings can be suppressed with -diag-suppress
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=1, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=2, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=2, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=3, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=3, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=4, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=4, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=1, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=1, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=2, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=2, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=3, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=3, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=4, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=4, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=1, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=1, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=2, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=2, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=3, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=3, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=4, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=4, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=1, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=1, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=2, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=2, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=3, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=3, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=4, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=4, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=1, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033
Remark: The warnings can be suppressed with -diag-suppress
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=1, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=2, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=2, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=3, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=3, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=4, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=4, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=1, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=1, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=2, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=2, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=3, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=3, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=4, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=4, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=1, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=1, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=2, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=2, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=3, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=3, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=4, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=4, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=1, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=1, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=2, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=2, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=3, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=3, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero
if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) {
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=4, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036
/tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero
(pipe / (group_blocks / thread_k_blocks)));
^
detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=4, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036
[13/14] Building CUDA object CMakeFiles/_C.dir/csrc/attention/attention_kernels.cu.o
/tmp/vllm/csrc/attention/attention_kernels.cu(625): warning #177-D: variable thread_group_size was declared but never referenced
int thread_group_size = ((32 / BLOCK_SIZE) > (1) ? (32 / BLOCK_SIZE) : (1));
^
Remark: The warnings can be suppressed with -diag-suppress
/tmp/vllm/csrc/attention/attention_kernels.cu(806): warning #177-D: variable thread_group_size was declared but never referenced
int thread_group_size = ((32 / BLOCK_SIZE) > (1) ? (32 / BLOCK_SIZE) : (1));
^
/tmp/vllm/csrc/attention/attention_kernels.cu(625): warning #177-D: variable thread_group_size was declared but never referenced
int thread_group_size = ((32 / BLOCK_SIZE) > (1) ? (32 / BLOCK_SIZE) : (1));
^
Remark: The warnings can be suppressed with -diag-suppress
/tmp/vllm/csrc/attention/attention_kernels.cu(806): warning #177-D: variable thread_group_size was declared but never referenced
int thread_group_size = ((32 / BLOCK_SIZE) > (1) ? (32 / BLOCK_SIZE) : (1));
^
/tmp/vllm/csrc/attention/attention_kernels.cu(625): warning #177-D: variable thread_group_size was declared but never referenced
int thread_group_size = ((32 / BLOCK_SIZE) > (1) ? (32 / BLOCK_SIZE) : (1));
^
Remark: The warnings can be suppressed with -diag-suppress
/tmp/vllm/csrc/attention/attention_kernels.cu(806): warning #177-D: variable thread_group_size was declared but never referenced
int thread_group_size = ((32 / BLOCK_SIZE) > (1) ? (32 / BLOCK_SIZE) : (1));
^
/tmp/vllm/csrc/attention/attention_kernels.cu(625): warning #177-D: variable thread_group_size was declared but never referenced
int thread_group_size = ((32 / BLOCK_SIZE) > (1) ? (32 / BLOCK_SIZE) : (1));
^
Remark: The warnings can be suppressed with -diag-suppress
/tmp/vllm/csrc/attention/attention_kernels.cu(806): warning #177-D: variable thread_group_size was declared but never referenced
int thread_group_size = ((32 / BLOCK_SIZE) > (1) ? (32 / BLOCK_SIZE) : (1));
^
[14/14] Linking CXX shared module /tmp/vllm/build/lib.linux-x86_64-cpython-310/vllm/_C.cpython-310-x86_64-linux-gnu.so
creating build/bdist.linux-x86_64
creating build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/vllm
copying build/lib.linux-x86_64-cpython-310/vllm/test_utils.py -> build/bdist.linux-x86_64/egg/vllm
copying build/lib.linux-x86_64-cpython-310/vllm/utils.py -> build/bdist.linux-x86_64/egg/vllm
creating build/bdist.linux-x86_64/egg/vllm/attention
copying build/lib.linux-x86_64-cpython-310/vllm/attention/layer.py -> build/bdist.linux-x86_64/egg/vllm/attention
copying build/lib.linux-x86_64-cpython-310/vllm/attention/__init__.py -> build/bdist.linux-x86_64/egg/vllm/attention
creating build/bdist.linux-x86_64/egg/vllm/attention/ops
copying build/lib.linux-x86_64-cpython-310/vllm/attention/ops/paged_attn.py -> build/bdist.linux-x86_64/egg/vllm/attention/ops
copying build/lib.linux-x86_64-cpython-310/vllm/attention/ops/prefix_prefill.py -> build/bdist.linux-x86_64/egg/vllm/attention/ops
copying build/lib.linux-x86_64-cpython-310/vllm/attention/ops/__init__.py -> build/bdist.linux-x86_64/egg/vllm/attention/ops
copying build/lib.linux-x86_64-cpython-310/vllm/attention/selector.py -> build/bdist.linux-x86_64/egg/vllm/attention
creating build/bdist.linux-x86_64/egg/vllm/attention/backends
copying build/lib.linux-x86_64-cpython-310/vllm/attention/backends/flash_attn.py -> build/bdist.linux-x86_64/egg/vllm/attention/backends
copying build/lib.linux-x86_64-cpython-310/vllm/attention/backends/xformers.py -> build/bdist.linux-x86_64/egg/vllm/attention/backends
copying build/lib.linux-x86_64-cpython-310/vllm/attention/backends/__init__.py -> build/bdist.linux-x86_64/egg/vllm/attention/backends
copying build/lib.linux-x86_64-cpython-310/vllm/attention/backends/abstract.py -> build/bdist.linux-x86_64/egg/vllm/attention/backends
creating build/bdist.linux-x86_64/egg/vllm/worker
copying build/lib.linux-x86_64-cpython-310/vllm/worker/neuron_model_runner.py -> build/bdist.linux-x86_64/egg/vllm/worker
copying build/lib.linux-x86_64-cpython-310/vllm/worker/worker.py -> build/bdist.linux-x86_64/egg/vllm/worker
copying build/lib.linux-x86_64-cpython-310/vllm/worker/cache_engine.py -> build/bdist.linux-x86_64/egg/vllm/worker
copying build/lib.linux-x86_64-cpython-310/vllm/worker/model_runner.py -> build/bdist.linux-x86_64/egg/vllm/worker
copying build/lib.linux-x86_64-cpython-310/vllm/worker/__init__.py -> build/bdist.linux-x86_64/egg/vllm/worker
copying build/lib.linux-x86_64-cpython-310/vllm/worker/neuron_worker.py -> build/bdist.linux-x86_64/egg/vllm/worker
copying build/lib.linux-x86_64-cpython-310/vllm/py.typed -> build/bdist.linux-x86_64/egg/vllm
creating build/bdist.linux-x86_64/egg/vllm/transformers_utils
copying build/lib.linux-x86_64-cpython-310/vllm/transformers_utils/tokenizer.py -> build/bdist.linux-x86_64/egg/vllm/transformers_utils
creating build/bdist.linux-x86_64/egg/vllm/transformers_utils/configs
copying build/lib.linux-x86_64-cpython-310/vllm/transformers_utils/configs/__init__.py -> build/bdist.linux-x86_64/egg/vllm/transformers_utils/configs
copying build/lib.linux-x86_64-cpython-310/vllm/transformers_utils/configs/chatglm.py -> build/bdist.linux-x86_64/egg/vllm/transformers_utils/configs
copying build/lib.linux-x86_64-cpython-310/vllm/transformers_utils/configs/jais.py -> build/bdist.linux-x86_64/egg/vllm/transformers_utils/configs
copying build/lib.linux-x86_64-cpython-310/vllm/transformers_utils/configs/mpt.py -> build/bdist.linux-x86_64/egg/vllm/transformers_utils/configs
copying build/lib.linux-x86_64-cpython-310/vllm/transformers_utils/configs/dbrx.py -> build/bdist.linux-x86_64/egg/vllm/transformers_utils/configs
copying build/lib.linux-x86_64-cpython-310/vllm/transformers_utils/configs/falcon.py -> build/bdist.linux-x86_64/egg/vllm/transformers_utils/configs
copying build/lib.linux-x86_64-cpython-310/vllm/transformers_utils/detokenizer.py -> build/bdist.linux-x86_64/egg/vllm/transformers_utils
copying build/lib.linux-x86_64-cpython-310/vllm/transformers_utils/__init__.py -> build/bdist.linux-x86_64/egg/vllm/transformers_utils
copying build/lib.linux-x86_64-cpython-310/vllm/transformers_utils/config.py -> build/bdist.linux-x86_64/egg/vllm/transformers_utils
creating build/bdist.linux-x86_64/egg/vllm/transformers_utils/tokenizers
copying build/lib.linux-x86_64-cpython-310/vllm/transformers_utils/tokenizers/__init__.py -> build/bdist.linux-x86_64/egg/vllm/transformers_utils/tokenizers
copying build/lib.linux-x86_64-cpython-310/vllm/transformers_utils/tokenizers/baichuan.py -> build/bdist.linux-x86_64/egg/vllm/transformers_utils/tokenizers
creating build/bdist.linux-x86_64/egg/vllm/transformers_utils/tokenizer_group
copying build/lib.linux-x86_64-cpython-310/vllm/transformers_utils/tokenizer_group/ray_tokenizer_group.py -> build/bdist.linux-x86_64/egg/vllm/transformers_utils/tokenizer_group
copying build/lib.linux-x86_64-cpython-310/vllm/transformers_utils/tokenizer_group/tokenizer_group.py -> build/bdist.linux-x86_64/egg/vllm/transformers_utils/tokenizer_group
copying build/lib.linux-x86_64-cpython-310/vllm/transformers_utils/tokenizer_group/__init__.py -> build/bdist.linux-x86_64/egg/vllm/transformers_utils/tokenizer_group
copying build/lib.linux-x86_64-cpython-310/vllm/transformers_utils/tokenizer_group/base_tokenizer_group.py -> build/bdist.linux-x86_64/egg/vllm/transformers_utils/tokenizer_group
creating build/bdist.linux-x86_64/egg/vllm/lora
copying build/lib.linux-x86_64-cpython-310/vllm/lora/models.py -> build/bdist.linux-x86_64/egg/vllm/lora
copying build/lib.linux-x86_64-cpython-310/vllm/lora/utils.py -> build/bdist.linux-x86_64/egg/vllm/lora
copying build/lib.linux-x86_64-cpython-310/vllm/lora/worker_manager.py -> build/bdist.linux-x86_64/egg/vllm/lora
copying build/lib.linux-x86_64-cpython-310/vllm/lora/request.py -> build/bdist.linux-x86_64/egg/vllm/lora
copying build/lib.linux-x86_64-cpython-310/vllm/lora/__init__.py -> build/bdist.linux-x86_64/egg/vllm/lora
copying build/lib.linux-x86_64-cpython-310/vllm/lora/punica.py -> build/bdist.linux-x86_64/egg/vllm/lora
copying build/lib.linux-x86_64-cpython-310/vllm/lora/lora.py -> build/bdist.linux-x86_64/egg/vllm/lora
copying build/lib.linux-x86_64-cpython-310/vllm/lora/layers.py -> build/bdist.linux-x86_64/egg/vllm/lora
creating build/bdist.linux-x86_64/egg/vllm/core
copying build/lib.linux-x86_64-cpython-310/vllm/core/block_manager_v1.py -> build/bdist.linux-x86_64/egg/vllm/core
copying build/lib.linux-x86_64-cpython-310/vllm/core/evictor.py -> build/bdist.linux-x86_64/egg/vllm/core
copying build/lib.linux-x86_64-cpython-310/vllm/core/block_manager_v2.py -> build/bdist.linux-x86_64/egg/vllm/core
copying build/lib.linux-x86_64-cpython-310/vllm/core/scheduler.py -> build/bdist.linux-x86_64/egg/vllm/core
copying build/lib.linux-x86_64-cpython-310/vllm/core/__init__.py -> build/bdist.linux-x86_64/egg/vllm/core
copying build/lib.linux-x86_64-cpython-310/vllm/core/policy.py -> build/bdist.linux-x86_64/egg/vllm/core
copying build/lib.linux-x86_64-cpython-310/vllm/core/interfaces.py -> build/bdist.linux-x86_64/egg/vllm/core
creating build/bdist.linux-x86_64/egg/vllm/model_executor
creating build/bdist.linux-x86_64/egg/vllm/model_executor/layers
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/sampler.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/layernorm.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers
creating build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe
creating build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe/configs
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/fused_moe/configs/E=16,N=1344,device_name=NVIDIA_A100-SXM4-80GB.json -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe/configs
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/fused_moe/configs/E=16,N=2688,device_name=NVIDIA_A100-SXM4-80GB.json -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe/configs
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/fused_moe/configs/E=8,N=3584,device_name=NVIDIA_A100-SXM4-80GB.json -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe/configs
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/fused_moe/configs/E=8,N=1792,device_name=NVIDIA_H100_80GB_HBM3.json -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe/configs
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/fused_moe/configs/E=8,N=7168,device_name=NVIDIA_A100-SXM4-80GB.json -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe/configs
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/fused_moe/configs/E=8,N=1792,device_name=NVIDIA_A100-SXM4-80GB.json -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe/configs
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/fused_moe/configs/E=8,N=3584,device_name=NVIDIA_H100_80GB_HBM3.json -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe/configs
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/fused_moe/configs/E=16,N=1344,device_name=NVIDIA_A100-SXM4-40GB.json -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe/configs
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/fused_moe/configs/E=8,N=7168,device_name=NVIDIA_H100_80GB_HBM3.json -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe/configs
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/fused_moe/configs/E=8,N=1792,device_name=NVIDIA_A100-SXM4-40GB.json -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe/configs
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/fused_moe/configs/E=16,N=1344,device_name=NVIDIA_H100_80GB_HBM3.json -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe/configs
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/fused_moe/configs/E=16,N=2688,device_name=NVIDIA_H100_80GB_HBM3.json -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe/configs
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/fused_moe/configs/E=8,N=3584,device_name=NVIDIA_A100-SXM4-40GB.json -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe/configs
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/fused_moe/__init__.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/fused_moe/fused_moe.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/vocab_parallel_embedding.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/rotary_embedding.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/linear.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers
creating build/bdist.linux-x86_64/egg/vllm/model_executor/layers/quantization
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/quantization/squeezellm.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/quantization
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/quantization/gptq.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/quantization
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/quantization/__init__.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/quantization
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/quantization/marlin.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/quantization
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/quantization/base_config.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/quantization
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/quantization/awq.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/quantization
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/__init__.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers
creating build/bdist.linux-x86_64/egg/vllm/model_executor/layers/ops
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/ops/sample.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/ops
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/ops/__init__.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/ops
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/ops/rand.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/ops
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/rejection_sampler.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/logits_processor.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/activation.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/utils.py -> build/bdist.linux-x86_64/egg/vllm/model_executor
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/sampling_metadata.py -> build/bdist.linux-x86_64/egg/vllm/model_executor
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/model_loader.py -> build/bdist.linux-x86_64/egg/vllm/model_executor
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/guided_decoding.py -> build/bdist.linux-x86_64/egg/vllm/model_executor
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/__init__.py -> build/bdist.linux-x86_64/egg/vllm/model_executor
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/guided_logits_processors.py -> build/bdist.linux-x86_64/egg/vllm/model_executor
creating build/bdist.linux-x86_64/egg/vllm/model_executor/parallel_utils
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/parallel_utils/parallel_state.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/parallel_utils
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/parallel_utils/utils.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/parallel_utils
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/parallel_utils/custom_all_reduce.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/parallel_utils
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/parallel_utils/communication_op.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/parallel_utils
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/parallel_utils/__init__.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/parallel_utils
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/parallel_utils/pynccl.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/parallel_utils
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/parallel_utils/pynccl_utils.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/parallel_utils
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/weight_utils.py -> build/bdist.linux-x86_64/egg/vllm/model_executor
creating build/bdist.linux-x86_64/egg/vllm/model_executor/models
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/bloom.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/orion.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/qwen2_moe.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/decilm.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/gpt2.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/gpt_bigcode.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/qwen2.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/starcoder2.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/opt.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/mixtral_quant.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/olmo.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/__init__.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/gemma.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/phi.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/llama.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/gpt_j.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/chatglm.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/commandr.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/xverse.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/stablelm.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/jais.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/internlm2.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/mixtral.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/qwen.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/deepseek.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/mpt.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/gpt_neox.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/llava.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/dbrx.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/baichuan.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/falcon.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models
copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/neuron_model_loader.py -> build/bdist.linux-x86_64/egg/vllm/model_executor
creating build/bdist.linux-x86_64/egg/vllm/executor
copying build/lib.linux-x86_64-cpython-310/vllm/executor/utils.py -> build/bdist.linux-x86_64/egg/vllm/executor
copying build/lib.linux-x86_64-cpython-310/vllm/executor/ray_gpu_executor.py -> build/bdist.linux-x86_64/egg/vllm/executor
copying build/lib.linux-x86_64-cpython-310/vllm/executor/executor_base.py -> build/bdist.linux-x86_64/egg/vllm/executor
copying build/lib.linux-x86_64-cpython-310/vllm/executor/gpu_executor.py -> build/bdist.linux-x86_64/egg/vllm/executor
copying build/lib.linux-x86_64-cpython-310/vllm/executor/neuron_executor.py -> build/bdist.linux-x86_64/egg/vllm/executor
copying build/lib.linux-x86_64-cpython-310/vllm/executor/__init__.py -> build/bdist.linux-x86_64/egg/vllm/executor
copying build/lib.linux-x86_64-cpython-310/vllm/sampling_params.py -> build/bdist.linux-x86_64/egg/vllm
copying build/lib.linux-x86_64-cpython-310/vllm/__init__.py -> build/bdist.linux-x86_64/egg/vllm
copying build/lib.linux-x86_64-cpython-310/vllm/config.py -> build/bdist.linux-x86_64/egg/vllm
creating build/bdist.linux-x86_64/egg/vllm/entrypoints
copying build/lib.linux-x86_64-cpython-310/vllm/entrypoints/llm.py -> build/bdist.linux-x86_64/egg/vllm/entrypoints
copying build/lib.linux-x86_64-cpython-310/vllm/entrypoints/__init__.py -> build/bdist.linux-x86_64/egg/vllm/entrypoints
copying build/lib.linux-x86_64-cpython-310/vllm/entrypoints/api_server.py -> build/bdist.linux-x86_64/egg/vllm/entrypoints
creating build/bdist.linux-x86_64/egg/vllm/entrypoints/openai
copying build/lib.linux-x86_64-cpython-310/vllm/entrypoints/openai/serving_completion.py -> build/bdist.linux-x86_64/egg/vllm/entrypoints/openai
copying build/lib.linux-x86_64-cpython-310/vllm/entrypoints/openai/serving_chat.py -> build/bdist.linux-x86_64/egg/vllm/entrypoints/openai
copying build/lib.linux-x86_64-cpython-310/vllm/entrypoints/openai/__init__.py -> build/bdist.linux-x86_64/egg/vllm/entrypoints/openai
copying build/lib.linux-x86_64-cpython-310/vllm/entrypoints/openai/api_server.py -> build/bdist.linux-x86_64/egg/vllm/entrypoints/openai
copying build/lib.linux-x86_64-cpython-310/vllm/entrypoints/openai/protocol.py -> build/bdist.linux-x86_64/egg/vllm/entrypoints/openai
copying build/lib.linux-x86_64-cpython-310/vllm/entrypoints/openai/cli_args.py -> build/bdist.linux-x86_64/egg/vllm/entrypoints/openai
copying build/lib.linux-x86_64-cpython-310/vllm/entrypoints/openai/serving_engine.py -> build/bdist.linux-x86_64/egg/vllm/entrypoints/openai
copying build/lib.linux-x86_64-cpython-310/vllm/outputs.py -> build/bdist.linux-x86_64/egg/vllm
creating build/bdist.linux-x86_64/egg/vllm/usage
copying build/lib.linux-x86_64-cpython-310/vllm/usage/usage_lib.py -> build/bdist.linux-x86_64/egg/vllm/usage
copying build/lib.linux-x86_64-cpython-310/vllm/usage/__init__.py -> build/bdist.linux-x86_64/egg/vllm/usage
copying build/lib.linux-x86_64-cpython-310/vllm/block.py -> build/bdist.linux-x86_64/egg/vllm
copying build/lib.linux-x86_64-cpython-310/vllm/logger.py -> build/bdist.linux-x86_64/egg/vllm
creating build/bdist.linux-x86_64/egg/vllm/engine
copying build/lib.linux-x86_64-cpython-310/vllm/engine/async_llm_engine.py -> build/bdist.linux-x86_64/egg/vllm/engine
copying build/lib.linux-x86_64-cpython-310/vllm/engine/arg_utils.py -> build/bdist.linux-x86_64/egg/vllm/engine
copying build/lib.linux-x86_64-cpython-310/vllm/engine/metrics.py -> build/bdist.linux-x86_64/egg/vllm/engine
copying build/lib.linux-x86_64-cpython-310/vllm/engine/__init__.py -> build/bdist.linux-x86_64/egg/vllm/engine
copying build/lib.linux-x86_64-cpython-310/vllm/engine/ray_utils.py -> build/bdist.linux-x86_64/egg/vllm/engine
copying build/lib.linux-x86_64-cpython-310/vllm/engine/llm_engine.py -> build/bdist.linux-x86_64/egg/vllm/engine
copying build/lib.linux-x86_64-cpython-310/vllm/sequence.py -> build/bdist.linux-x86_64/egg/vllm
copying build/lib.linux-x86_64-cpython-310/vllm/_moe_C.cpython-310-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg/vllm
copying build/lib.linux-x86_64-cpython-310/vllm/_C.cpython-310-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg/vllm
creating build/bdist.linux-x86_64/egg/tests
creating build/bdist.linux-x86_64/egg/tests/spec_decode
copying build/lib.linux-x86_64-cpython-310/tests/spec_decode/test_utils.py -> build/bdist.linux-x86_64/egg/tests/spec_decode
copying build/lib.linux-x86_64-cpython-310/tests/spec_decode/utils.py -> build/bdist.linux-x86_64/egg/tests/spec_decode
copying build/lib.linux-x86_64-cpython-310/tests/spec_decode/test_batch_expansion.py -> build/bdist.linux-x86_64/egg/tests/spec_decode
copying build/lib.linux-x86_64-cpython-310/tests/spec_decode/test_metrics.py -> build/bdist.linux-x86_64/egg/tests/spec_decode
copying build/lib.linux-x86_64-cpython-310/tests/spec_decode/__init__.py -> build/bdist.linux-x86_64/egg/tests/spec_decode
copying build/lib.linux-x86_64-cpython-310/tests/spec_decode/test_spec_decode_worker.py -> build/bdist.linux-x86_64/egg/tests/spec_decode
copying build/lib.linux-x86_64-cpython-310/tests/spec_decode/test_multi_step_worker.py -> build/bdist.linux-x86_64/egg/tests/spec_decode
creating build/bdist.linux-x86_64/egg/tests/worker
copying build/lib.linux-x86_64-cpython-310/tests/worker/test_swap.py -> build/bdist.linux-x86_64/egg/tests/worker
copying build/lib.linux-x86_64-cpython-310/tests/worker/__init__.py -> build/bdist.linux-x86_64/egg/tests/worker
copying build/lib.linux-x86_64-cpython-310/tests/worker/test_model_runner.py -> build/bdist.linux-x86_64/egg/tests/worker
creating build/bdist.linux-x86_64/egg/tests/tokenization
copying build/lib.linux-x86_64-cpython-310/tests/tokenization/test_tokenizer_group.py -> build/bdist.linux-x86_64/egg/tests/tokenization
copying build/lib.linux-x86_64-cpython-310/tests/tokenization/test_detokenize.py -> build/bdist.linux-x86_64/egg/tests/tokenization
copying build/lib.linux-x86_64-cpython-310/tests/tokenization/__init__.py -> build/bdist.linux-x86_64/egg/tests/tokenization
copying build/lib.linux-x86_64-cpython-310/tests/tokenization/test_cached_tokenizer.py -> build/bdist.linux-x86_64/egg/tests/tokenization
creating build/bdist.linux-x86_64/egg/tests/lora
copying build/lib.linux-x86_64-cpython-310/tests/lora/test_tokenizer_group.py -> build/bdist.linux-x86_64/egg/tests/lora
copying build/lib.linux-x86_64-cpython-310/tests/lora/test_utils.py -> build/bdist.linux-x86_64/egg/tests/lora
copying build/lib.linux-x86_64-cpython-310/tests/lora/test_lora.py -> build/bdist.linux-x86_64/egg/tests/lora
copying build/lib.linux-x86_64-cpython-310/tests/lora/utils.py -> build/bdist.linux-x86_64/egg/tests/lora
copying build/lib.linux-x86_64-cpython-310/tests/lora/test_layers.py -> build/bdist.linux-x86_64/egg/tests/lora
copying build/lib.linux-x86_64-cpython-310/tests/lora/test_mixtral.py -> build/bdist.linux-x86_64/egg/tests/lora
copying build/lib.linux-x86_64-cpython-310/tests/lora/test_chatglm3.py -> build/bdist.linux-x86_64/egg/tests/lora
copying build/lib.linux-x86_64-cpython-310/tests/lora/test_lora_manager.py -> build/bdist.linux-x86_64/egg/tests/lora
copying build/lib.linux-x86_64-cpython-310/tests/lora/conftest.py -> build/bdist.linux-x86_64/egg/tests/lora
copying build/lib.linux-x86_64-cpython-310/tests/lora/test_llama.py -> build/bdist.linux-x86_64/egg/tests/lora
copying build/lib.linux-x86_64-cpython-310/tests/lora/test_worker.py -> build/bdist.linux-x86_64/egg/tests/lora
copying build/lib.linux-x86_64-cpython-310/tests/lora/test_punica.py -> build/bdist.linux-x86_64/egg/tests/lora
copying build/lib.linux-x86_64-cpython-310/tests/lora/__init__.py -> build/bdist.linux-x86_64/egg/tests/lora
copying build/lib.linux-x86_64-cpython-310/tests/lora/test_baichuan.py -> build/bdist.linux-x86_64/egg/tests/lora
copying build/lib.linux-x86_64-cpython-310/tests/lora/test_layer_variation.py -> build/bdist.linux-x86_64/egg/tests/lora
copying build/lib.linux-x86_64-cpython-310/tests/lora/test_gemma.py -> build/bdist.linux-x86_64/egg/tests/lora
creating build/bdist.linux-x86_64/egg/tests/core
copying build/lib.linux-x86_64-cpython-310/tests/core/utils.py -> build/bdist.linux-x86_64/egg/tests/core
copying build/lib.linux-x86_64-cpython-310/tests/core/test_block_manager.py -> build/bdist.linux-x86_64/egg/tests/core
copying build/lib.linux-x86_64-cpython-310/tests/core/test_scheduler.py -> build/bdist.linux-x86_64/egg/tests/core
creating build/bdist.linux-x86_64/egg/tests/core/block
copying build/lib.linux-x86_64-cpython-310/tests/core/block/test_naive_block.py -> build/bdist.linux-x86_64/egg/tests/core/block
copying build/lib.linux-x86_64-cpython-310/tests/core/block/test_cpu_gpu_block_allocator.py -> build/bdist.linux-x86_64/egg/tests/core/block
copying build/lib.linux-x86_64-cpython-310/tests/core/block/test_common.py -> build/bdist.linux-x86_64/egg/tests/core/block
copying build/lib.linux-x86_64-cpython-310/tests/core/block/test_prefix_caching_block.py -> build/bdist.linux-x86_64/egg/tests/core/block
copying build/lib.linux-x86_64-cpython-310/tests/core/block/__init__.py -> build/bdist.linux-x86_64/egg/tests/core/block
copying build/lib.linux-x86_64-cpython-310/tests/core/block/test_block_table.py -> build/bdist.linux-x86_64/egg/tests/core/block
copying build/lib.linux-x86_64-cpython-310/tests/core/block/test_block_space_manager.py -> build/bdist.linux-x86_64/egg/tests/core/block
copying build/lib.linux-x86_64-cpython-310/tests/core/__init__.py -> build/bdist.linux-x86_64/egg/tests/core
byte-compiling build/bdist.linux-x86_64/egg/vllm/test_utils.py to test_utils.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/utils.py to utils.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/attention/layer.py to layer.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/attention/__init__.py to __init__.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/attention/ops/paged_attn.py to paged_attn.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/attention/ops/prefix_prefill.py to prefix_prefill.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/attention/ops/__init__.py to __init__.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/attention/selector.py to selector.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/attention/backends/flash_attn.py to flash_attn.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/attention/backends/xformers.py to xformers.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/attention/backends/__init__.py to __init__.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/attention/backends/abstract.py to abstract.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/worker/neuron_model_runner.py to neuron_model_runner.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/worker/worker.py to worker.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/worker/cache_engine.py to cache_engine.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/worker/model_runner.py to model_runner.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/worker/__init__.py to __init__.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/worker/neuron_worker.py to neuron_worker.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/transformers_utils/tokenizer.py to tokenizer.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/transformers_utils/configs/__init__.py to __init__.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/transformers_utils/configs/chatglm.py to chatglm.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/transformers_utils/configs/jais.py to jais.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/transformers_utils/configs/mpt.py to mpt.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/transformers_utils/configs/dbrx.py to dbrx.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/transformers_utils/configs/falcon.py to falcon.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/transformers_utils/detokenizer.py to detokenizer.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/transformers_utils/__init__.py to __init__.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/transformers_utils/config.py to config.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/transformers_utils/tokenizers/__init__.py to __init__.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/transformers_utils/tokenizers/baichuan.py to baichuan.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/transformers_utils/tokenizer_group/ray_tokenizer_group.py to ray_tokenizer_group.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/transformers_utils/tokenizer_group/tokenizer_group.py to tokenizer_group.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/transformers_utils/tokenizer_group/__init__.py to __init__.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/transformers_utils/tokenizer_group/base_tokenizer_group.py to base_tokenizer_group.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/lora/models.py to models.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/lora/utils.py to utils.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/lora/worker_manager.py to worker_manager.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/lora/request.py to request.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/lora/__init__.py to __init__.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/lora/punica.py to punica.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/lora/lora.py to lora.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/lora/layers.py to layers.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/core/block_manager_v1.py to block_manager_v1.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/core/evictor.py to evictor.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/core/block_manager_v2.py to block_manager_v2.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/core/scheduler.py to scheduler.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/core/__init__.py to __init__.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/core/policy.py to policy.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/core/interfaces.py to interfaces.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/sampler.py to sampler.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/layernorm.py to layernorm.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe/__init__.py to __init__.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe/fused_moe.py to fused_moe.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/vocab_parallel_embedding.py to vocab_parallel_embedding.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/rotary_embedding.py to rotary_embedding.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/linear.py to linear.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/quantization/squeezellm.py to squeezellm.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/quantization/gptq.py to gptq.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/quantization/__init__.py to __init__.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/quantization/marlin.py to marlin.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/quantization/base_config.py to base_config.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/quantization/awq.py to awq.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/__init__.py to __init__.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/ops/sample.py to sample.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/ops/__init__.py to __init__.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/ops/rand.py to rand.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/rejection_sampler.py to rejection_sampler.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/logits_processor.py to logits_processor.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/activation.py to activation.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/utils.py to utils.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/sampling_metadata.py to sampling_metadata.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/model_loader.py to model_loader.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/guided_decoding.py to guided_decoding.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/__init__.py to __init__.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/guided_logits_processors.py to guided_logits_processors.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/parallel_utils/parallel_state.py to parallel_state.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/parallel_utils/utils.py to utils.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/parallel_utils/custom_all_reduce.py to custom_all_reduce.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/parallel_utils/communication_op.py to communication_op.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/parallel_utils/__init__.py to __init__.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/parallel_utils/pynccl.py to pynccl.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/parallel_utils/pynccl_utils.py to pynccl_utils.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/weight_utils.py to weight_utils.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/bloom.py to bloom.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/orion.py to orion.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/qwen2_moe.py to qwen2_moe.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/decilm.py to decilm.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/gpt2.py to gpt2.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/gpt_bigcode.py to gpt_bigcode.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/qwen2.py to qwen2.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/starcoder2.py to starcoder2.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/opt.py to opt.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/mixtral_quant.py to mixtral_quant.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/olmo.py to olmo.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/__init__.py to __init__.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/gemma.py to gemma.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/phi.py to phi.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/llama.py to llama.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/gpt_j.py to gpt_j.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/chatglm.py to chatglm.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/commandr.py to commandr.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/xverse.py to xverse.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/stablelm.py to stablelm.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/jais.py to jais.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/internlm2.py to internlm2.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/mixtral.py to mixtral.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/qwen.py to qwen.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/deepseek.py to deepseek.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/mpt.py to mpt.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/gpt_neox.py to gpt_neox.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/llava.py to llava.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/dbrx.py to dbrx.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/baichuan.py to baichuan.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/falcon.py to falcon.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/neuron_model_loader.py to neuron_model_loader.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/executor/utils.py to utils.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/executor/ray_gpu_executor.py to ray_gpu_executor.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/executor/executor_base.py to executor_base.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/executor/gpu_executor.py to gpu_executor.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/executor/neuron_executor.py to neuron_executor.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/executor/__init__.py to __init__.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/sampling_params.py to sampling_params.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/__init__.py to __init__.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/config.py to config.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/entrypoints/llm.py to llm.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/entrypoints/__init__.py to __init__.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/entrypoints/api_server.py to api_server.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/entrypoints/openai/serving_completion.py to serving_completion.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/entrypoints/openai/serving_chat.py to serving_chat.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/entrypoints/openai/__init__.py to __init__.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/entrypoints/openai/api_server.py to api_server.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/entrypoints/openai/protocol.py to protocol.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/entrypoints/openai/cli_args.py to cli_args.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/entrypoints/openai/serving_engine.py to serving_engine.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/outputs.py to outputs.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/usage/usage_lib.py to usage_lib.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/usage/__init__.py to __init__.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/block.py to block.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/logger.py to logger.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/engine/async_llm_engine.py to async_llm_engine.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/engine/arg_utils.py to arg_utils.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/engine/metrics.py to metrics.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/engine/__init__.py to __init__.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/engine/ray_utils.py to ray_utils.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/engine/llm_engine.py to llm_engine.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/sequence.py to sequence.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/spec_decode/test_utils.py to test_utils.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/spec_decode/utils.py to utils.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/spec_decode/test_batch_expansion.py to test_batch_expansion.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/spec_decode/test_metrics.py to test_metrics.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/spec_decode/__init__.py to __init__.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/spec_decode/test_spec_decode_worker.py to test_spec_decode_worker.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/spec_decode/test_multi_step_worker.py to test_multi_step_worker.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/worker/test_swap.py to test_swap.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/worker/__init__.py to __init__.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/worker/test_model_runner.py to test_model_runner.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/tokenization/test_tokenizer_group.py to test_tokenizer_group.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/tokenization/test_detokenize.py to test_detokenize.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/tokenization/__init__.py to __init__.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/tokenization/test_cached_tokenizer.py to test_cached_tokenizer.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/lora/test_tokenizer_group.py to test_tokenizer_group.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/lora/test_utils.py to test_utils.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/lora/test_lora.py to test_lora.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/lora/utils.py to utils.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/lora/test_layers.py to test_layers.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/lora/test_mixtral.py to test_mixtral.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/lora/test_chatglm3.py to test_chatglm3.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/lora/test_lora_manager.py to test_lora_manager.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/lora/conftest.py to conftest.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/lora/test_llama.py to test_llama.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/lora/test_worker.py to test_worker.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/lora/test_punica.py to test_punica.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/lora/__init__.py to __init__.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/lora/test_baichuan.py to test_baichuan.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/lora/test_layer_variation.py to test_layer_variation.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/lora/test_gemma.py to test_gemma.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/core/utils.py to utils.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/core/test_block_manager.py to test_block_manager.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/core/test_scheduler.py to test_scheduler.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/core/block/test_naive_block.py to test_naive_block.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/core/block/test_cpu_gpu_block_allocator.py to test_cpu_gpu_block_allocator.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/core/block/test_common.py to test_common.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/core/block/test_prefix_caching_block.py to test_prefix_caching_block.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/core/block/__init__.py to __init__.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/core/block/test_block_table.py to test_block_table.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/core/block/test_block_space_manager.py to test_block_space_manager.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/core/__init__.py to __init__.cpython-310.pyc
creating stub loader for vllm/_moe_C.cpython-310-x86_64-linux-gnu.so
creating stub loader for vllm/_C.cpython-310-x86_64-linux-gnu.so
byte-compiling build/bdist.linux-x86_64/egg/vllm/_moe_C.py to _moe_C.cpython-310.pyc
byte-compiling build/bdist.linux-x86_64/egg/vllm/_C.py to _C.cpython-310.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying vllm.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying vllm.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying vllm.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying vllm.egg-info/requires.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying vllm.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
writing build/bdist.linux-x86_64/egg/EGG-INFO/native_libs.txt
zip_safe flag not set; analyzing archive contents...
vllm.__pycache__._C.cpython-310: module references __file__
vllm.__pycache__._moe_C.cpython-310: module references __file__
vllm.model_executor.layers.fused_moe.__pycache__.fused_moe.cpython-310: module references __file__
creating dist
creating dist/vllm-0.4.0-py3.10-linux-x86_64.egg and adding build/bdist.linux-x86_64/egg to it
removing build/bdist.linux-x86_64/egg (and everything under it)
Processing vllm-0.4.0-py3.10-linux-x86_64.egg
creating /usr/local/lib/python3.10/dist-packages/vllm-0.4.0-py3.10-linux-x86_64.egg
Extracting vllm-0.4.0-py3.10-linux-x86_64.egg to /usr/local/lib/python3.10/dist-packages
Adding vllm 0.4.0 to easy-install.pth file
Installed /usr/local/lib/python3.10/dist-packages/vllm-0.4.0-py3.10-linux-x86_64.egg
Processing dependencies for vllm==0.4.0
Searching for tiktoken==0.6.0
Best match: tiktoken 0.6.0
Adding tiktoken 0.6.0 to easy-install.pth file
detected new path ./vllm-0.4.0-py3.10-linux-x86_64.egg
Using /usr/local/lib/python3.10/dist-packages
Searching for outlines==0.0.34
Best match: outlines 0.0.34
Adding outlines 0.0.34 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for triton==2.1.0
Best match: triton 2.1.0
Adding triton 2.1.0 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for pynvml==11.5.0
Best match: pynvml 11.5.0
Adding pynvml 11.5.0 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for prometheus-client==0.20.0
Best match: prometheus-client 0.20.0
Adding prometheus-client 0.20.0 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for pydantic==2.6.4
Best match: pydantic 2.6.4
Adding pydantic 2.6.4 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for uvicorn==0.29.0
Best match: uvicorn 0.29.0
Adding uvicorn 0.29.0 to easy-install.pth file
Installing uvicorn script to /usr/local/bin
Using /usr/local/lib/python3.10/dist-packages
Searching for fastapi==0.110.0
Best match: fastapi 0.110.0
Adding fastapi 0.110.0 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for xformers==0.0.23.post1
Best match: xformers 0.0.23.post1
Adding xformers 0.0.23.post1 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for transformers==4.39.2
Best match: transformers 4.39.2
Adding transformers 4.39.2 to easy-install.pth file
Installing transformers-cli script to /usr/local/bin
Using /usr/local/lib/python3.10/dist-packages
Searching for py-cpuinfo==9.0.0
Best match: py-cpuinfo 9.0.0
Adding py-cpuinfo 9.0.0 to easy-install.pth file
Installing cpuinfo script to /usr/local/bin
Using /usr/local/lib/python3.10/dist-packages
Searching for psutil==5.9.8
Best match: psutil 5.9.8
Adding psutil 5.9.8 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for requests==2.31.0
Best match: requests 2.31.0
Adding requests 2.31.0 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for torch==2.1.2
Best match: torch 2.1.2
Adding torch 2.1.2 to easy-install.pth file
Installing convert-caffe2-to-onnx script to /usr/local/bin
Installing convert-onnx-to-caffe2 script to /usr/local/bin
Installing torchrun script to /usr/local/bin
Using /usr/local/lib/python3.10/dist-packages
Searching for numpy==1.26.4
Best match: numpy 1.26.4
Adding numpy 1.26.4 to easy-install.pth file
Installing f2py script to /usr/local/bin
Using /usr/local/lib/python3.10/dist-packages
Searching for sentencepiece==0.2.0
Best match: sentencepiece 0.2.0
Adding sentencepiece 0.2.0 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for ray==2.10.0
Best match: ray 2.10.0
Adding ray 2.10.0 to easy-install.pth file
Installing ray script to /usr/local/bin
Installing rllib script to /usr/local/bin
Installing serve script to /usr/local/bin
Installing tune script to /usr/local/bin
Using /usr/local/lib/python3.10/dist-packages
Searching for ninja==1.11.1.1
Best match: ninja 1.11.1.1
Adding ninja 1.11.1.1 to easy-install.pth file
Installing ninja script to /usr/local/bin
Using /usr/local/lib/python3.10/dist-packages
Searching for cmake==3.29.0.1
Best match: cmake 3.29.0.1
Adding cmake 3.29.0.1 to easy-install.pth file
Installing cmake script to /usr/local/bin
Installing cpack script to /usr/local/bin
Installing ctest script to /usr/local/bin
Using /usr/local/lib/python3.10/dist-packages
Searching for regex==2023.12.25
Best match: regex 2023.12.25
Adding regex 2023.12.25 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for jsonschema==4.21.1
Best match: jsonschema 4.21.1
Adding jsonschema 4.21.1 to easy-install.pth file
Installing jsonschema script to /usr/local/bin
Using /usr/local/lib/python3.10/dist-packages
Searching for referencing==0.34.0
Best match: referencing 0.34.0
Adding referencing 0.34.0 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for joblib==1.3.2
Best match: joblib 1.3.2
Adding joblib 1.3.2 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for numba==0.59.1
Best match: numba 0.59.1
Adding numba 0.59.1 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for scipy==1.12.0
Best match: scipy 1.12.0
Adding scipy 1.12.0 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for diskcache==5.6.3
Best match: diskcache 5.6.3
Adding diskcache 5.6.3 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for cloudpickle==3.0.0
Best match: cloudpickle 3.0.0
Adding cloudpickle 3.0.0 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for nest-asyncio==1.6.0
Best match: nest-asyncio 1.6.0
Adding nest-asyncio 1.6.0 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for lark==1.1.9
Best match: lark 1.1.9
Adding lark 1.1.9 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for Jinja2==3.1.3
Best match: Jinja2 3.1.3
Adding Jinja2 3.1.3 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for interegular==0.3.3
Best match: interegular 0.3.3
Adding interegular 0.3.3 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for filelock==3.13.3
Best match: filelock 3.13.3
Adding filelock 3.13.3 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for typing-extensions==4.10.0
Best match: typing-extensions 4.10.0
Adding typing-extensions 4.10.0 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for pydantic-core==2.16.3
Best match: pydantic-core 2.16.3
Adding pydantic-core 2.16.3 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for annotated-types==0.6.0
Best match: annotated-types 0.6.0
Adding annotated-types 0.6.0 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for websockets==12.0
Best match: websockets 12.0
Adding websockets 12.0 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for watchfiles==0.21.0
Best match: watchfiles 0.21.0
Adding watchfiles 0.21.0 to easy-install.pth file
Installing watchfiles script to /usr/local/bin
Using /usr/local/lib/python3.10/dist-packages
Searching for uvloop==0.19.0
Best match: uvloop 0.19.0
Adding uvloop 0.19.0 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for PyYAML==6.0.1
Best match: PyYAML 6.0.1
Adding PyYAML 6.0.1 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for python-dotenv==1.0.1
Best match: python-dotenv 1.0.1
Adding python-dotenv 1.0.1 to easy-install.pth file
Installing dotenv script to /usr/local/bin
Using /usr/local/lib/python3.10/dist-packages
Searching for httptools==0.6.1
Best match: httptools 0.6.1
Adding httptools 0.6.1 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for h11==0.14.0
Best match: h11 0.14.0
Adding h11 0.14.0 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for click==8.1.7
Best match: click 8.1.7
Adding click 8.1.7 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for starlette==0.36.3
Best match: starlette 0.36.3
Adding starlette 0.36.3 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for tqdm==4.66.2
Best match: tqdm 4.66.2
Adding tqdm 4.66.2 to easy-install.pth file
Installing tqdm script to /usr/local/bin
Using /usr/local/lib/python3.10/dist-packages
Searching for safetensors==0.4.2
Best match: safetensors 0.4.2
Adding safetensors 0.4.2 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for tokenizers==0.15.2
Best match: tokenizers 0.15.2
Adding tokenizers 0.15.2 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for packaging==24.0
Best match: packaging 24.0
Adding packaging 24.0 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for huggingface-hub==0.21.4
Best match: huggingface-hub 0.21.4
Adding huggingface-hub 0.21.4 to easy-install.pth file
Installing huggingface-cli script to /usr/local/bin
Using /usr/local/lib/python3.10/dist-packages
Searching for certifi==2024.2.2
Best match: certifi 2024.2.2
Adding certifi 2024.2.2 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for urllib3==2.2.1
Best match: urllib3 2.2.1
Adding urllib3 2.2.1 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for idna==3.6
Best match: idna 3.6
Adding idna 3.6 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for charset-normalizer==3.3.2
Best match: charset-normalizer 3.3.2
Adding charset-normalizer 3.3.2 to easy-install.pth file
Installing normalizer script to /usr/local/bin
Using /usr/local/lib/python3.10/dist-packages
Searching for nvidia-nvtx-cu12==12.1.105
Best match: nvidia-nvtx-cu12 12.1.105
Adding nvidia-nvtx-cu12 12.1.105 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for nvidia-nccl-cu12==2.18.1
Best match: nvidia-nccl-cu12 2.18.1
Adding nvidia-nccl-cu12 2.18.1 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for nvidia-cusparse-cu12==12.1.0.106
Best match: nvidia-cusparse-cu12 12.1.0.106
Adding nvidia-cusparse-cu12 12.1.0.106 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for nvidia-cusolver-cu12==11.4.5.107
Best match: nvidia-cusolver-cu12 11.4.5.107
Adding nvidia-cusolver-cu12 11.4.5.107 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for nvidia-curand-cu12==10.3.2.106
Best match: nvidia-curand-cu12 10.3.2.106
Adding nvidia-curand-cu12 10.3.2.106 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for nvidia-cufft-cu12==11.0.2.54
Best match: nvidia-cufft-cu12 11.0.2.54
Adding nvidia-cufft-cu12 11.0.2.54 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for nvidia-cublas-cu12==12.1.3.1
Best match: nvidia-cublas-cu12 12.1.3.1
Adding nvidia-cublas-cu12 12.1.3.1 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for nvidia-cudnn-cu12==8.9.2.26
Best match: nvidia-cudnn-cu12 8.9.2.26
Adding nvidia-cudnn-cu12 8.9.2.26 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for nvidia-cuda-cupti-cu12==12.1.105
Best match: nvidia-cuda-cupti-cu12 12.1.105
Adding nvidia-cuda-cupti-cu12 12.1.105 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for nvidia-cuda-runtime-cu12==12.1.105
Best match: nvidia-cuda-runtime-cu12 12.1.105
Adding nvidia-cuda-runtime-cu12 12.1.105 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for nvidia-cuda-nvrtc-cu12==12.1.105
Best match: nvidia-cuda-nvrtc-cu12 12.1.105
Adding nvidia-cuda-nvrtc-cu12 12.1.105 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for fsspec==2024.3.1
Best match: fsspec 2024.3.1
Adding fsspec 2024.3.1 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for networkx==3.2.1
Best match: networkx 3.2.1
Adding networkx 3.2.1 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for sympy==1.12
Best match: sympy 1.12
Adding sympy 1.12 to easy-install.pth file
Installing isympy script to /usr/local/bin
Using /usr/local/lib/python3.10/dist-packages
Searching for frozenlist==1.4.1
Best match: frozenlist 1.4.1
Adding frozenlist 1.4.1 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for aiosignal==1.3.1
Best match: aiosignal 1.3.1
Adding aiosignal 1.3.1 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for protobuf==4.25.3
Best match: protobuf 4.25.3
Adding protobuf 4.25.3 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for msgpack==1.0.8
Best match: msgpack 1.0.8
Adding msgpack 1.0.8 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for rpds-py==0.18.0
Best match: rpds-py 0.18.0
Adding rpds-py 0.18.0 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for jsonschema-specifications==2023.12.1
Best match: jsonschema-specifications 2023.12.1
Adding jsonschema-specifications 2023.12.1 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for attrs==23.2.0
Best match: attrs 23.2.0
Adding attrs 23.2.0 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for llvmlite==0.42.0
Best match: llvmlite 0.42.0
Adding llvmlite 0.42.0 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for MarkupSafe==2.1.5
Best match: MarkupSafe 2.1.5
Adding MarkupSafe 2.1.5 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for anyio==4.3.0
Best match: anyio 4.3.0
Adding anyio 4.3.0 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for nvidia-nvjitlink-cu12==12.4.99
Best match: nvidia-nvjitlink-cu12 12.4.99
Adding nvidia-nvjitlink-cu12 12.4.99 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for mpmath==1.3.0
Best match: mpmath 1.3.0
Adding mpmath 1.3.0 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for exceptiongroup==1.2.0
Best match: exceptiongroup 1.2.0
Adding exceptiongroup 1.2.0 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Searching for sniffio==1.3.1
Best match: sniffio 1.3.1
Adding sniffio 1.3.1 to easy-install.pth file
Using /usr/local/lib/python3.10/dist-packages
Finished processing dependencies for vllm==0.4.0

the problem is weird, i couldn't find it out. @youkaichao

youkaichao commented 3 months ago

ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.

Looks like a network problem.

njhill commented 3 months ago

@kn1011 could you try with latest main again now that https://github.com/vllm-project/vllm/pull/3770 is merged?

nilichen commented 3 months ago

Hi I'm still running into issues with v0.4.0.post1, though it includes the fix https://github.com/vllm-project/vllm/pull/3770 so I'm not running into RuntimeError: CUDA error: invalid device ordinal anymore.

2024-04-03 20:45:37,879 INFO worker.py:1724 -- Started a local Ray instance.
[04/03/24 20:46:18] INFO     Initializing an LLM engine (v0.4.0.post1) with config: model='/opt/models/mixtral_awq_8x7b_v1', tokenizer='/opt/models/mixtral_awq_8x7b_v1', tokenizer_mode=auto, revision=None,                   llm_engine.py:74
                             tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=safetensors, tensor_parallel_size=2, disable_custom_all_reduce=False,
                             quantization=awq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
[04/03/24 20:46:49] DEBUG    Loading nccl from library libnccl.so.2                                                                                                                                                                 pynccl.py:48
                    INFO     Using FlashAttention backend.                                                                                                                                                                        selector.py:16
[04/03/24 20:46:59] INFO     vLLM is using nccl==2.18.1                                                                                                                                                                       pynccl_utils.py:45
*** SIGTERM received at time=1712177260 on cpu 0 ***
PC: @     0x78e7d142c7f8  (unknown)  clock_nanosleep
    @     0x78e7d1389520  (unknown)  (unknown)
    @ ... and at least 1 more frames
[2024-04-03 20:47:40,813 E 1 1] logging.cc:361: *** SIGTERM received at time=1712177260 on cpu 0 ***
[2024-04-03 20:47:40,813 E 1 1] logging.cc:361: PC: @     0x78e7d142c7f8  (unknown)  clock_nanosleep
[2024-04-03 20:47:40,813 E 1 1] logging.cc:361:     @     0x78e7d1389520  (unknown)  (unknown)
[2024-04-03 20:47:40,813 E 1 1] logging.cc:361:     @ ... and at least 1 more frames
Exception ignored in: <function NCCLCommunicator.__del__ at 0x78e5dc367100>
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/vllm/model_executor/parallel_utils/pynccl.py", line 264, in __del__
    _c_ncclCommDestroy(self.comm)
                       ^^^^^^^^^
AttributeError: 'NCCLCommunicator' object has no attribute 'comm'
youkaichao commented 3 months ago

Looks like your program is killed by SIGTERM .

nilichen commented 3 months ago

Oh I see. It's first SIGTERM, then we are calling _c_ncclCommDestroy(self.comm). Hmm not sure why's that. I'll take another look!

JasmondL commented 3 months ago

Hi @youkaichao,

I've run into the same problem with an NCCL error. I'm working with vLLM version 0.4.0 and CUDA 12.1.

Here are some details about my setup: Environment:

Error message:

  File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 156, in <module>
    engine = AsyncLLMEngine.from_engine_args(
  File "/usr/local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 348, in from_engine_args
    engine = cls(
  File "/usr/local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 311, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 422, in _init_engine
    return engine_class(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 111, in __init__
    self.model_executor = executor_class(model_config, cache_config,
  File "/usr/local/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 62, in __init__
    self._init_workers_ray(placement_group)
  File "/usr/local/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 191, in _init_workers_ray
    self._run_workers("init_device")
  File "/usr/local/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 324, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/usr/local/lib/python3.10/site-packages/vllm/worker/worker.py", line 100, in init_device
    init_distributed_environment(self.parallel_config, self.rank,
  File "/usr/local/lib/python3.10/site-packages/vllm/worker/worker.py", line 287, in init_distributed_environment
    pynccl_utils.init_process_group(
  File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/parallel_utils/pynccl_utils.py", line 46, in init_process_group
    comm = NCCLCommunicator(init_method=init_method,
  File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/parallel_utils/pynccl.py", line 236, in __init__
    dist.broadcast(tensor, src=0)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1906, in broadcast
    work = default_pg.broadcast([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'initialization error'
(RayWorkerVllm pid=42738) Exception ignored in: <function NCCLCommunicator.__del__ at 0x7fbbbdcd60e0>
(RayWorkerVllm pid=42738) Traceback (most recent call last):
(RayWorkerVllm pid=42738)   File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/parallel_utils/pynccl.py", line 264, in __del__
(RayWorkerVllm pid=42738)     _c_ncclCommDestroy(self.comm)
(RayWorkerVllm pid=42738) AttributeError: 'NCCLCommunicator' object has no attribute 'comm'
(RayWorkerVllm pid=42738) ERROR 04-05 09:23:35 ray_utils.py:44] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerVllm pid=42738) ERROR 04-05 09:23:35 ray_utils.py:44] Traceback (most recent call last):
(RayWorkerVllm pid=42738) ERROR 04-05 09:23:35 ray_utils.py:44]   File "/usr/local/lib/python3.10/site-packages/vllm/engine/ray_utils.py", line 37, in execute_method
(RayWorkerVllm pid=42738) ERROR 04-05 09:23:35 ray_utils.py:44]     return executor(*args, **kwargs)
(RayWorkerVllm pid=42738) ERROR 04-05 09:23:35 ray_utils.py:44]   File "/usr/local/lib/python3.10/site-packages/vllm/worker/worker.py", line 100, in init_device
(RayWorkerVllm pid=42738) ERROR 04-05 09:23:35 ray_utils.py:44]     init_distributed_environment(self.parallel_config, self.rank,
(RayWorkerVllm pid=42738) ERROR 04-05 09:23:35 ray_utils.py:44]   File "/usr/local/lib/python3.10/site-packages/vllm/worker/worker.py", line 287, in init_distributed_environment
(RayWorkerVllm pid=42738) ERROR 04-05 09:23:35 ray_utils.py:44]     pynccl_utils.init_process_group(
(RayWorkerVllm pid=42738) ERROR 04-05 09:23:35 ray_utils.py:44]   File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/parallel_utils/pynccl_utils.py", line 46, in init_process_group
(RayWorkerVllm pid=42738) ERROR 04-05 09:23:35 ray_utils.py:44]     comm = NCCLCommunicator(init_method=init_method,
(RayWorkerVllm pid=42738) ERROR 04-05 09:23:35 ray_utils.py:44]   File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/parallel_utils/pynccl.py", line 236, in __init__
(RayWorkerVllm pid=42738) ERROR 04-05 09:23:35 ray_utils.py:44]     dist.broadcast(tensor, src=0)
(RayWorkerVllm pid=42738) ERROR 04-05 09:23:35 ray_utils.py:44]   File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
(RayWorkerVllm pid=42738) ERROR 04-05 09:23:35 ray_utils.py:44]     return func(*args, **kwargs)
(RayWorkerVllm pid=42738) ERROR 04-05 09:23:35 ray_utils.py:44]   File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1906, in broadcast
(RayWorkerVllm pid=42738) ERROR 04-05 09:23:35 ray_utils.py:44]     work = default_pg.broadcast([tensor], opts)
(RayWorkerVllm pid=42738) ERROR 04-05 09:23:35 ray_utils.py:44] torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
(RayWorkerVllm pid=42738) ERROR 04-05 09:23:35 ray_utils.py:44] ncclUnhandledCudaError: Call to CUDA function failed.
(RayWorkerVllm pid=42738) ERROR 04-05 09:23:35 ray_utils.py:44] Last error:
(RayWorkerVllm pid=42738) ERROR 04-05 09:23:35 ray_utils.py:44] Cuda failure 'initialization error'
(RayWorkerVllm pid=43542) INFO 04-05 09:23:32 selector.py:45] Cannot use FlashAttention because the package is not found. Please install it for better performance. [repeated 6x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(RayWorkerVllm pid=43542) INFO 04-05 09:23:32 selector.py:21] Using XFormers backend. [repeated 6x across cluster]
(RayWorkerVllm pid=43542) INFO 04-05 09:23:34 pynccl_utils.py:45] vLLM is using nccl==2.18.1 [repeated 6x across cluster]
(RayWorkerVllm pid=43542) ERROR 04-05 09:23:35 ray_utils.py:44] Error executing method init_device. This might cause deadlock in distributed execution. [repeated 6x across cluster]
(RayWorkerVllm pid=43542) ERROR 04-05 09:23:35 ray_utils.py:44] Traceback (most recent call last): [repeated 6x across cluster]
(RayWorkerVllm pid=43542) ERROR 04-05 09:23:35 ray_utils.py:44]   File "/usr/local/lib/python3.10/site-packages/vllm/engine/ray_utils.py", line 37, in execute_method [repeated 6x across cluster]
(RayWorkerVllm pid=43542) ERROR 04-05 09:23:35 ray_utils.py:44]     return executor(*args, **kwargs) [repeated 6x across cluster]
(RayWorkerVllm pid=43542) ERROR 04-05 09:23:35 ray_utils.py:44]   File "/usr/local/lib/python3.10/site-packages/vllm/worker/worker.py", line 100, in init_device [repeated 6x across cluster]
(RayWorkerVllm pid=43542) ERROR 04-05 09:23:35 ray_utils.py:44]     init_distributed_environment(self.parallel_config, self.rank, [repeated 6x across cluster]
(RayWorkerVllm pid=43542) ERROR 04-05 09:23:35 ray_utils.py:44]   File "/usr/local/lib/python3.10/site-packages/vllm/worker/worker.py", line 287, in init_distributed_environment [repeated 6x across cluster]
(RayWorkerVllm pid=43542) ERROR 04-05 09:23:35 ray_utils.py:44]     pynccl_utils.init_process_group( [repeated 6x across cluster]
(RayWorkerVllm pid=43542) ERROR 04-05 09:23:35 ray_utils.py:44]   File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/parallel_utils/pynccl_utils.py", line 46, in init_process_group [repeated 6x across cluster]
(RayWorkerVllm pid=43542) ERROR 04-05 09:23:35 ray_utils.py:44]     comm = NCCLCommunicator(init_method=init_method, [repeated 6x across cluster]
(RayWorkerVllm pid=43542) ERROR 04-05 09:23:35 ray_utils.py:44]   File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/parallel_utils/pynccl.py", line 236, in __init__ [repeated 6x across cluster]
(RayWorkerVllm pid=43542) ERROR 04-05 09:23:35 ray_utils.py:44]     dist.broadcast(tensor, src=0) [repeated 6x across cluster]
(RayWorkerVllm pid=43542) ERROR 04-05 09:23:35 ray_utils.py:44]   File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper [repeated 6x across cluster]
(RayWorkerVllm pid=43542) ERROR 04-05 09:23:35 ray_utils.py:44]     return func(*args, **kwargs) [repeated 6x across cluster]
(RayWorkerVllm pid=43542) ERROR 04-05 09:23:35 ray_utils.py:44]   File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1906, in broadcast [repeated 6x across cluster]
(RayWorkerVllm pid=43542) ERROR 04-05 09:23:35 ray_utils.py:44]     work = default_pg.broadcast([tensor], opts) [repeated 6x across cluster]
(RayWorkerVllm pid=43542) ERROR 04-05 09:23:35 ray_utils.py:44] torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1 [repeated 6x across cluster]
(RayWorkerVllm pid=43542) ERROR 04-05 09:23:35 ray_utils.py:44] ncclUnhandledCudaError: Call to CUDA function failed. [repeated 6x across cluster]
(RayWorkerVllm pid=43542) ERROR 04-05 09:23:35 ray_utils.py:44] Last error: [repeated 6x across cluster]
(RayWorkerVllm pid=43542) ERROR 04-05 09:23:35 ray_utils.py:44] Cuda failure 'initialization error' [repeated 6x across cluster]
(RayWorkerVllm pid=43542) Exception ignored in: <function NCCLCommunicator.__del__ at 0x7fc5d9ef20e0> [repeated 6x across cluster]
(RayWorkerVllm pid=43542) Traceback (most recent call last): [repeated 6x across cluster]
(RayWorkerVllm pid=43542)   File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/parallel_utils/pynccl.py", line 264, in __del__ [repeated 6x across cluster]
(RayWorkerVllm pid=43542)     _c_ncclCommDestroy(self.comm) [repeated 6x across cluster]
(RayWorkerVllm pid=43542) AttributeError: 'NCCLCommunicator' object has no attribute 'comm' [repeated 6x across cluster]
vllm-mistral-8x7b-v01-5b8f6cd668-b5xs6:32838:32838 [0] NCCL INFO comm 0xa602f30 rank 0 nranks 8 cudaDev 0 busId 160 - Abort COMPLETE
Exception ignored in: <function NCCLCommunicator.__del__ at 0x7f06cdb8b0a0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/parallel_utils/pynccl.py", line 264, in __del__
    _c_ncclCommDestroy(self.comm)
AttributeError: 'NCCLCommunicator' object has no attribute 'comm'
youkaichao commented 3 months ago

@JasmondL can you try to build from source with the latest main? I think your problem should be fixed by https://github.com/vllm-project/vllm/pull/3860 .

scottsuk0306 commented 3 months ago

@JasmondL I was able to resolve the error with different CUDA_VISIBLE_DEVICES permutations: when I changed my command from CUDA_VISIBLE_DEVICES=7,8,9,10 python -m ... to CUDA_VISIBLE_DEVICES=10,9,8,7 python -m ... it worked without that NCCLCommunicator error.

This seems to be related with https://github.com/pytorch/pytorch/issues/113245#issuecomment-1909409587

yudian0504 commented 3 months ago

same

ljdavns commented 3 months ago

same issue on Tesla T4 GPU with v0.4.0.post1

Corleno commented 3 months ago

@njhill, @youkaichao , I have the same issue for v0.4.0.post1 using the latest mainline (04/14/2024) source with cuda 11.8 (Nvidia) and it reported the same error in #3770. Basically, each Ray process can only find one GPU count as total, instead of the true total GPU number (Always torch.cuda.device_count=1). Pining the version of ray (2.9.3) instead of using 2.10.0 as https://github.com/vllm-project/vllm/pull/3770 does not work for me (https://github.com/vllm-project/vllm/pull/3699). I reverted the current version to https://github.com/vllm-project/vllm/commit/0ce0539d4750f9ebcd9b19d7085ca3b934b9ec67 (04/07/2024), it resolves this device initialization issue.

hmellor commented 3 months ago

@kn1011 are you still experiencing this error?

nilichen commented 3 months ago

I got it working now!

On Fri, Apr 19, 2024 at 8:26 PM Harry Mellor @.***> wrote:

@kn1011 https://github.com/kn1011 are you still experiencing this error?

— Reply to this email directly, view it on GitHub https://github.com/vllm-project/vllm/issues/3722#issuecomment-2067421736, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACFAWHZJIHDERY35CSKJKKDY6GY5BAVCNFSM6AAAAABFN77JS6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRXGQZDCNZTGY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Corleno commented 3 months ago

Do you know if the latest mainline has fixed this issue?

On Sun, Apr 21, 2024 at 7:05 PM Katrina Ni @.***> wrote:

I got it working now!

On Fri, Apr 19, 2024 at 8:26 PM Harry Mellor @.***> wrote:

@kn1011 https://github.com/kn1011 are you still experiencing this error?

— Reply to this email directly, view it on GitHub < https://github.com/vllm-project/vllm/issues/3722#issuecomment-2067421736>,

or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACFAWHZJIHDERY35CSKJKKDY6GY5BAVCNFSM6AAAAABFN77JS6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRXGQZDCNZTGY>

. You are receiving this because you are subscribed to this thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/vllm-project/vllm/issues/3722#issuecomment-2068356480, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC2Y4YXWZQD6V6GFQIN7IGTY6RV5PAVCNFSM6AAAAABFN77JS6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRYGM2TMNBYGA . You are receiving this because you commented.Message ID: @.***>

hmellor commented 3 months ago

@nilichen what did you do to get it working?

nilichen commented 2 months ago

Note though I'm doing something different and initializing the generator myself. The error [RuntimeError: CUDA error: invalid device ordinal with multi node multi gpus](https://github.com/vllm-project/vllm/issues/3722#top) is gone after updating to v0.4.0.post1. However I ran into SIGTERM for unknown reasons. And I managed to make it work after specifying both engine_use_ray and worker_use_ray to True when initializing the generator. To me worker_use_ray = True makes sense, but it's not obvious that we need engine_use_ray = True for distributed execution from the doc.

worker_use_ray – Whether to use Ray for model workers. Required for distributed execution. Should be the same as parallel_config.worker_use_ray. engine_use_ray – Whether to make LLMEngine a Ray actor. If so, the async frontend will be executed in a separate process as the model workers.

            generator = VLLMGenerator(
                engine=AsyncLLMEngine.from_engine_args(
                    AsyncEngineArgs(
                        engine_use_ray=gpu_count > 1,
                        worker_use_ray=gpu_count > 1,
                        disable_log_requests=True,
                        disable_log_stats=False,
                        dtype=torch.float16,
                        max_model_len=8192,
                        tensor_parallel_size=gpu_count,
                        load_format="safetensors",
                        model=model_artifact.download_path(),
                    )
                )
            )