[Bug]: Loading Mixtral-8x22B-Instruct-v0.1-FP8 on 8xL40S causes a SIGSEGV

nickandbro commented 4 months ago

Your current environment

For setup, I am using the version 0.5 and the vllm_openai target as part of the Dockerfile with these arguments:

    environment:
    - NCCL_SOCKET_IFNAME=eth0
    restart: unless-stopped
    ulimits:
      memlock: -1
      stack: -1
    ports:
      - "3010:8000"
    ipc: host
    command:
      - "--model"
      - "/models/Mixtral-8x22B-Instruct-v0.1-FP8"
      - "--gpu-memory-utilization"
      - "0.95" 
      - "--tensor-parallel-size"
      - "8"
      - "--host"
      - "0.0.0.0"
      - "--max-num-seqs" 
      - "70"
      - "--quantization"
      - "fp8"
      - "--download-dir"
      - "/models"

🐛 Describe the bug

When I load Mixtral-8x22B-Instruct-v0.1-FP8 onto 8 L40S it causes this error:

Attaching to vllm1-1
vllm1-1  | (VllmWorkerProcess pid=14207) INFO 06-13 00:51:33 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
vllm1-1  | (VllmWorkerProcess pid=14205) INFO 06-13 00:51:33 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
vllm1-1  | (VllmWorkerProcess pid=14206) INFO 06-13 00:51:33 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
vllm1-1  | (VllmWorkerProcess pid=14210) INFO 06-13 00:51:33 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
vllm1-1  | (VllmWorkerProcess pid=14204) INFO 06-13 00:51:33 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
vllm1-1  | (VllmWorkerProcess pid=14208) INFO 06-13 00:51:34 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
vllm1-1  | (VllmWorkerProcess pid=14209) INFO 06-13 00:51:34 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
vllm1-1  | INFO 06-13 00:51:34 utils.py:637] Found nccl from library libnccl.so.2
vllm1-1  | INFO 06-13 00:51:34 pynccl.py:63] vLLM is using nccl==2.20.5
vllm1-1  | (VllmWorkerProcess pid=14205) INFO 06-13 00:51:34 utils.py:637] Found nccl from library libnccl.so.2
vllm1-1  | (VllmWorkerProcess pid=14205) INFO 06-13 00:51:34 pynccl.py:63] vLLM is using nccl==2.20.5
vllm1-1  | (VllmWorkerProcess pid=14209) INFO 06-13 00:51:34 utils.py:637] Found nccl from library libnccl.so.2
vllm1-1  | (VllmWorkerProcess pid=14204) INFO 06-13 00:51:34 utils.py:637] Found nccl from library libnccl.so.2
vllm1-1  | (VllmWorkerProcess pid=14209) INFO 06-13 00:51:34 pynccl.py:63] vLLM is using nccl==2.20.5
vllm1-1  | (VllmWorkerProcess pid=14206) INFO 06-13 00:51:34 utils.py:637] Found nccl from library libnccl.so.2
vllm1-1  | (VllmWorkerProcess pid=14204) INFO 06-13 00:51:34 pynccl.py:63] vLLM is using nccl==2.20.5
vllm1-1  | (VllmWorkerProcess pid=14206) INFO 06-13 00:51:34 pynccl.py:63] vLLM is using nccl==2.20.5
vllm1-1  | (VllmWorkerProcess pid=14207) INFO 06-13 00:51:34 utils.py:637] Found nccl from library libnccl.so.2
vllm1-1  | (VllmWorkerProcess pid=14208) INFO 06-13 00:51:34 utils.py:637] Found nccl from library libnccl.so.2
vllm1-1  | (VllmWorkerProcess pid=14208) INFO 06-13 00:51:34 pynccl.py:63] vLLM is using nccl==2.20.5
vllm1-1  | (VllmWorkerProcess pid=14207) INFO 06-13 00:51:34 pynccl.py:63] vLLM is using nccl==2.20.5
vllm1-1  | (VllmWorkerProcess pid=14210) INFO 06-13 00:51:34 utils.py:637] Found nccl from library libnccl.so.2
vllm1-1  | (VllmWorkerProcess pid=14210) INFO 06-13 00:51:34 pynccl.py:63] vLLM is using nccl==2.20.5
vllm1-1  | Traceback (most recent call last):
vllm1-1  |   File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
vllm1-1  |     cache[rtype].remove(name)
vllm1-1  | KeyError: '/psm_38be8863'
vllm1-1  | Traceback (most recent call last):
vllm1-1  |   File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
vllm1-1  |     cache[rtype].remove(name)
vllm1-1  | KeyError: '/psm_38be8863'
vllm1-1  | Traceback (most recent call last):
vllm1-1  |   File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
vllm1-1  |     cache[rtype].remove(name)
vllm1-1  | KeyError: '/psm_38be8863'
vllm1-1  | Traceback (most recent call last):
vllm1-1  |   File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
vllm1-1  |     cache[rtype].remove(name)
vllm1-1  | KeyError: '/psm_38be8863'
vllm1-1  | Traceback (most recent call last):
vllm1-1  |   File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
vllm1-1  |     cache[rtype].remove(name)
vllm1-1  | KeyError: '/psm_38be8863'
vllm1-1  | Traceback (most recent call last):
vllm1-1  |   File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
vllm1-1  |     cache[rtype].remove(name)
vllm1-1  | KeyError: '/psm_38be8863'
vllm1-1  | Traceback (most recent call last):
vllm1-1  |   File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
vllm1-1  |     cache[rtype].remove(name)
vllm1-1  | KeyError: '/psm_38be8863'
vllm1-1  | WARNING 06-13 00:51:34 custom_all_reduce.py:165] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
vllm1-1  | (VllmWorkerProcess pid=14209) WARNING 06-13 00:51:34 custom_all_reduce.py:165] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
vllm1-1  | (VllmWorkerProcess pid=14210) WARNING 06-13 00:51:34 custom_all_reduce.py:165] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
vllm1-1  | (VllmWorkerProcess pid=14208) WARNING 06-13 00:51:34 custom_all_reduce.py:165] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
vllm1-1  | (VllmWorkerProcess pid=14206) WARNING 06-13 00:51:34 custom_all_reduce.py:165] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
vllm1-1  | (VllmWorkerProcess pid=14207) WARNING 06-13 00:51:34 custom_all_reduce.py:165] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
vllm1-1  | (VllmWorkerProcess pid=14204) WARNING 06-13 00:51:34 custom_all_reduce.py:165] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
vllm1-1  | (VllmWorkerProcess pid=14205) WARNING 06-13 00:51:34 custom_all_reduce.py:165] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
vllm1-1  | (VllmWorkerProcess pid=14204) WARNING 06-13 00:51:34 fp8.py:48] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
vllm1-1  | WARNING 06-13 00:51:34 fp8.py:48] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
vllm1-1  | (VllmWorkerProcess pid=14205) WARNING 06-13 00:51:34 fp8.py:48] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
vllm1-1  | (VllmWorkerProcess pid=14207) WARNING 06-13 00:51:34 fp8.py:48] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
vllm1-1  | (VllmWorkerProcess pid=14206) WARNING 06-13 00:51:34 fp8.py:48] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
vllm1-1  | (VllmWorkerProcess pid=14209) WARNING 06-13 00:51:34 fp8.py:48] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
vllm1-1  | (VllmWorkerProcess pid=14208) WARNING 06-13 00:51:34 fp8.py:48] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
vllm1-1  | (VllmWorkerProcess pid=14210) WARNING 06-13 00:51:34 fp8.py:48] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
vllm1-1  | WARNING 06-13 00:51:55 utils.py:465] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer. 
vllm1-1  | INFO 06-13 00:51:55 model_runner.py:160] Loading model weights took 16.4319 GB
vllm1-1  | (VllmWorkerProcess pid=14204) WARNING 06-13 00:51:56 utils.py:465] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer. 
vllm1-1  | (VllmWorkerProcess pid=14208) WARNING 06-13 00:51:56 utils.py:465] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer. 
vllm1-1  | (VllmWorkerProcess pid=14205) WARNING 06-13 00:51:56 utils.py:465] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer. 
vllm1-1  | (VllmWorkerProcess pid=14206) WARNING 06-13 00:51:56 utils.py:465] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer. 
vllm1-1  | (VllmWorkerProcess pid=14208) INFO 06-13 00:51:56 model_runner.py:160] Loading model weights took 16.4319 GB
vllm1-1  | (VllmWorkerProcess pid=14204) INFO 06-13 00:51:56 model_runner.py:160] Loading model weights took 16.4319 GB
vllm1-1  | (VllmWorkerProcess pid=14207) WARNING 06-13 00:51:56 utils.py:465] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer. 
vllm1-1  | (VllmWorkerProcess pid=14210) WARNING 06-13 00:51:56 utils.py:465] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer. 
vllm1-1  | (VllmWorkerProcess pid=14205) INFO 06-13 00:51:56 model_runner.py:160] Loading model weights took 16.4319 GB
vllm1-1  | (VllmWorkerProcess pid=14209) WARNING 06-13 00:51:57 utils.py:465] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer. 
vllm1-1  | (VllmWorkerProcess pid=14206) INFO 06-13 00:51:57 model_runner.py:160] Loading model weights took 16.4319 GB
vllm1-1  | (VllmWorkerProcess pid=14207) INFO 06-13 00:51:57 model_runner.py:160] Loading model weights took 16.4319 GB
vllm1-1  | (VllmWorkerProcess pid=14210) INFO 06-13 00:51:57 model_runner.py:160] Loading model weights took 16.4319 GB
vllm1-1  | (VllmWorkerProcess pid=14209) INFO 06-13 00:51:57 model_runner.py:160] Loading model weights took 16.4319 GB
vllm1-1  | Conversion from/to f8e4m3nv is only supported on compute capability >= 90
vllm1-1  | 
vllm1-1  | UNREACHABLE executed at /project/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp:823!
vllm1-1  | Conversion from/to f8e4m3nv is only supported on compute capability >= 90
vllm1-1  | 
vllm1-1  | UNREACHABLE executed at /project/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp:823!
vllm1-1  | Conversion from/to f8e4m3nv is only supported on compute capability >= 90
vllm1-1  | 
vllm1-1  | UNREACHABLE executed at /project/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp:823!
vllm1-1  | Conversion from/to f8e4m3nv is only supported on compute capability >= 90
vllm1-1  | 
vllm1-1  | UNREACHABLE executed at /project/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp:823!
vllm1-1  | Conversion from/to f8e4m3nv is only supported on compute capability >= 90
vllm1-1  | 
vllm1-1  | UNREACHABLE executed at /project/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp:823!
vllm1-1  | Conversion from/to f8e4m3nv is only supported on compute capability >= 90
vllm1-1  | 
vllm1-1  | UNREACHABLE executed at /project/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp:823!
vllm1-1  | *** SIGABRT received at time=1718239918 on cpu 73 ***
vllm1-1  | Conversion from/to f8e4m3nv is only supported on compute capability >= 90
vllm1-1  | 
vllm1-1  | UNREACHABLE executed at /project/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp:823!
vllm1-1  | PC: @     0x14c3bc9359fc  (unknown)  pthread_kill
vllm1-1  |     @     0x14c3bc8e1520  (unknown)  (unknown)
vllm1-1  | [2024-06-13 00:51:58,039 E 1 1] logging.cc:343: *** SIGABRT received at time=1718239918 on cpu 73 ***
vllm1-1  | [2024-06-13 00:51:58,039 E 1 1] logging.cc:343: PC: @     0x14c3bc9359fc  (unknown)  pthread_kill
vllm1-1  | [2024-06-13 00:51:58,040 E 1 1] logging.cc:343:     @     0x14c3bc8e1520  (unknown)  (unknown)
vllm1-1  | Fatal Python error: Aborted
vllm1-1  | 
vllm1-1  | Stack (most recent call first):
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/backends/cuda.py", line 173 in make_llir
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/backends/cuda.py", line 199 in <lambda>
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 193 in compile
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 416 in run
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 167 in <lambda>
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 245 in invoke_fused_moe_kernel
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 427 in fused_experts
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 515 in fused_moe
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 271 in forward
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 424 in forward
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 468 in forward
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 535 in forward
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 749 in execute_model
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 844 in profile_run
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 162 in determine_num_available_blocks
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 119 in _run_workers
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 38 in determine_num_available_blocks
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 313 in _initialize_kv_caches
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 236 in __init__
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 473 in _init_engine
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 349 in __init__
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 398 in from_engine_args
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 196 in <module>
vllm1-1  |   File "/usr/lib/python3.10/runpy.py", line 86 in _run_code
vllm1-1  |   File "/usr/lib/python3.10/runpy.py", line 196 in _run_module_as_main
vllm1-1  | 
vllm1-1  | Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, charset_normalizer.md, simplejson._speedups, requests.packages.charset_normalizer.md, requests.packages.chardet.md, yaml._yaml, psutil._psutil_linux, psutil._psutil_posix, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, sentencepiece._sentencepiece, ujson, regex._regex, scipy._lib._ccallback_c, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, xxhash._xxhash, pyarrow._json, markupsafe._speedups, PIL._imaging, cuda_utils (total: 103)
vllm1-1  | [failure_signal_handler.cc : 332] RAW: Signal 11 raised at PC=0x14c3bc8c7898 while already in AbslFailureSignalHandler()
vllm1-1  | *** SIGSEGV received at time=1718239918 on cpu 73 ***
vllm1-1  | PC: @     0x14c3bc8c7898  (unknown)  abort
vllm1-1  |     @     0x14c3bc8e1520  (unknown)  (unknown)
vllm1-1  | Conversion from/to f8e4m3nv is only supported on compute capability >= 90
vllm1-1  | 
vllm1-1  | UNREACHABLE executed at /project/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp:823!
vllm1-1  |     @     0x14c3bc89e1c0  (unknown)  (unknown)
vllm1-1  | [2024-06-13 00:51:58,044 E 1 1] logging.cc:343: *** SIGSEGV received at time=1718239918 on cpu 73 ***
vllm1-1  | [2024-06-13 00:51:58,044 E 1 1] logging.cc:343: PC: @     0x14c3bc8c7898  (unknown)  abort
vllm1-1  | [2024-06-13 00:51:58,046 E 1 1] logging.cc:343:     @     0x14c3bc8e1520  (unknown)  (unknown)
vllm1-1  | [2024-06-13 00:51:58,048 E 1 1] logging.cc:343:     @     0x14c3bc89e1c0  (unknown)  (unknown)
vllm1-1  | Fatal Python error: Segmentation fault
vllm1-1  | 
vllm1-1  | Stack (most recent call first):
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/backends/cuda.py", line 173 in make_llir
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/backends/cuda.py", line 199 in <lambda>
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 193 in compile
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 416 in run
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 167 in <lambda>
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 245 in invoke_fused_moe_kernel
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 427 in fused_experts
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 515 in fused_moe
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 271 in forward
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 424 in forward
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 468 in forward
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 535 in forward
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 749 in execute_model
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 844 in profile_run
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 162 in determine_num_available_blocks
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 119 in _run_workers
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 38 in determine_num_available_blocks
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 313 in _initialize_kv_caches
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 236 in __init__
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 473 in _init_engine
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 349 in __init__
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 398 in from_engine_args
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 196 in <module>
vllm1-1  |   File "/usr/lib/python3.10/runpy.py", line 86 in _run_code
vllm1-1  |   File "/usr/lib/python3.10/runpy.py", line 196 in _run_module_as_main
vllm1-1  | 
vllm1-1  | Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, charset_normalizer.md, simplejson._speedups, requests.packages.charset_normalizer.md, requests.packages.chardet.md, yaml._yaml, psutil._psutil_linux, psutil._psutil_posix, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, sentencepiece._sentencepiece, ujson, regex._regex, scipy._lib._ccallback_c, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, xxhash._xxhash, pyarrow._json, markupsafe._speedups, PIL._imaging, cuda_utils (total: 103)
vllm1-1  | /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 24 leaked semaphore objects to clean up at shutdown
vllm1-1  |   warnings.warn('resource_tracker: There appear to be %d '
vllm1-1 exited with code 0

Any help would be much appreciated!

youkaichao commented 4 months ago

Conversion from/to f8e4m3nv is only supported on compute capability >= 90

L40S should have compute capability == 89, you need to use H100 for inference with fp8 models.

nickandbro commented 4 months ago

Thanks!

mgoin commented 4 months ago

@nickandbro you can try uninstalling your triton and using the triton nightly, directions here https://github.com/triton-lang/triton?tab=readme-ov-file#quick-installation

Currently the triton 2.3 that we require due to PyTorch cannot support Ada Lovelace, but future releases will.

nickandbro commented 4 months ago

@mgoin Thanks! Kinda new to Triton, is that a custom kernel that sits ontop of cuda that vllm uses? If so, I believe all I need to do is just swap out the building of the kernel with this using the nightly: https://llvm.org/docs/CMake.html

If I could do FP8 using my own ada hardware, that would be legendary.

mgoin commented 4 months ago

Triton isn't a custom kernel in itself, but a library for JITing kernels at runtime. So all you need to do is upgrade the python package that is installed. After installing vllm, try uninstalling triton and installing to a newer version or the nightly to see if they have resolved this issue.

nivibilla commented 2 months ago

@mgoin im gettign the same error but with mixtral 8x7b in fp16 with 8xL4 GPUs. I also tried installing Triton from source but that didn't work either.

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]
INFO 08-24 11:03:27 api_server.py:339] vLLM API server version 0.5.4
INFO 08-24 11:03:27 api_server.py:340] args: Namespace(model_tag='/local_disk0/mistralai/Mixtral-8x7B-Instruct-v0.1', host='0.0.0.0', port=1234, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='/local_disk0/mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=8192, guided_decoding_backend='outlines', distributed_executor_backend='ray', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=32, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['mixtral-8x7b-instruct-v0.1'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7fe5bd9216c0>)
WARNING 08-24 11:03:27 config.py:1454] Casting torch.bfloat16 to torch.float16.
2024-08-24 11:03:43,771 INFO worker.py:1772 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
INFO 08-24 11:03:51 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='/local_disk0/mistralai/Mixtral-8x7B-Instruct-v0.1', speculative_config=None, tokenizer='/local_disk0/mistralai/Mixtral-8x7B-Instruct-v0.1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=mixtral-8x7b-instruct-v0.1, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-24 11:03:51 ray_gpu_executor.py:117] use_ray_spmd_worker: False
INFO 08-24 11:03:51 ray_gpu_executor.py:120] driver_ip: 10.168.76.49
INFO 08-24 11:05:00 utils.py:841] Found nccl from library libnccl.so.2
INFO 08-24 11:05:00 pynccl.py:63] vLLM is using nccl==2.22.3
(RayWorkerWrapper pid=35540) INFO 08-24 11:05:00 utils.py:841] Found nccl from library libnccl.so.2
(RayWorkerWrapper pid=35540) INFO 08-24 11:05:00 pynccl.py:63] vLLM is using nccl==2.22.3
WARNING 08-24 11:05:01 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 08-24 11:05:01 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fe5072ccb10>, local_subscribe_port=33399, remote_subscribe_port=None)
INFO 08-24 11:05:01 model_runner.py:720] Starting to load model /local_disk0/mistralai/Mixtral-8x7B-Instruct-v0.1...
(RayWorkerWrapper pid=35540) WARNING 08-24 11:05:01 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(RayWorkerWrapper pid=35540) INFO 08-24 11:05:01 model_runner.py:720] Starting to load model /local_disk0/mistralai/Mixtral-8x7B-Instruct-v0.1...
Loading safetensors checkpoint shards:   0% Completed | 0/19 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   5% Completed | 1/19 [00:00<00:02,  7.54it/s]
Loading safetensors checkpoint shards:  11% Completed | 2/19 [00:00<00:02,  5.93it/s]
Loading safetensors checkpoint shards:  16% Completed | 3/19 [00:00<00:02,  5.38it/s]
Loading safetensors checkpoint shards:  21% Completed | 4/19 [00:00<00:02,  5.10it/s]
Loading safetensors checkpoint shards:  26% Completed | 5/19 [00:00<00:02,  4.99it/s]
Loading safetensors checkpoint shards:  32% Completed | 6/19 [00:01<00:02,  5.01it/s]
Loading safetensors checkpoint shards:  37% Completed | 7/19 [00:01<00:02,  4.96it/s]
Loading safetensors checkpoint shards:  42% Completed | 8/19 [00:01<00:02,  4.90it/s]
Loading safetensors checkpoint shards:  47% Completed | 9/19 [00:01<00:02,  4.89it/s]
Loading safetensors checkpoint shards:  53% Completed | 10/19 [00:01<00:01,  4.92it/s]
Loading safetensors checkpoint shards:  58% Completed | 11/19 [00:02<00:01,  4.87it/s]
Loading safetensors checkpoint shards:  63% Completed | 12/19 [00:02<00:01,  4.77it/s]
Loading safetensors checkpoint shards:  68% Completed | 13/19 [00:02<00:01,  4.78it/s]
Loading safetensors checkpoint shards:  74% Completed | 14/19 [00:02<00:01,  4.81it/s]
Loading safetensors checkpoint shards:  79% Completed | 15/19 [00:03<00:00,  4.93it/s]
Loading safetensors checkpoint shards:  84% Completed | 16/19 [00:03<00:00,  4.96it/s]
Loading safetensors checkpoint shards:  89% Completed | 17/19 [00:03<00:00,  4.96it/s]
Loading safetensors checkpoint shards:  95% Completed | 18/19 [00:03<00:00,  4.96it/s]
Loading safetensors checkpoint shards: 100% Completed | 19/19 [00:03<00:00,  4.92it/s]
Loading safetensors checkpoint shards: 100% Completed | 19/19 [00:03<00:00,  4.98it/s]

INFO 08-24 11:05:05 model_runner.py:732] Loading model weights took 10.8853 GB
(RayWorkerWrapper pid=35540) INFO 08-24 11:05:16 model_runner.py:732] Loading model weights took 10.8853 GB
(RayWorkerWrapper pid=36687) INFO 08-24 11:05:00 utils.py:841] Found nccl from library libnccl.so.2 [repeated 6x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(RayWorkerWrapper pid=36687) INFO 08-24 11:05:00 pynccl.py:63] vLLM is using nccl==2.22.3 [repeated 6x across cluster]
(RayWorkerWrapper pid=36687) WARNING 08-24 11:05:01 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. [repeated 6x across cluster]
(RayWorkerWrapper pid=36687) INFO 08-24 11:05:01 model_runner.py:720] Starting to load model /local_disk0/mistralai/Mixtral-8x7B-Instruct-v0.1... [repeated 6x across cluster]
(RayWorkerWrapper pid=36687) *** SIGSEGV received at time=1724497518 on cpu 99 ***
(RayWorkerWrapper pid=36687) PC: @           0x5266a0  (unknown)  (unknown)
(RayWorkerWrapper pid=36687)     @     0x7fb005816520   47342376  (unknown)
(RayWorkerWrapper pid=36687)     @     0x7fafac8ad900  (unknown)  (unknown)
(RayWorkerWrapper pid=36687)     @           0x95e040  (unknown)  (unknown)
(RayWorkerWrapper pid=36687) [2024-08-24 11:05:18,199 E 36687 36687] logging.cc:440: *** SIGSEGV received at time=1724497518 on cpu 99 ***
(RayWorkerWrapper pid=36687) [2024-08-24 11:05:18,202 E 36687 36687] logging.cc:440: PC: @           0x5266a0  (unknown)  (unknown)
(RayWorkerWrapper pid=36687) [2024-08-24 11:05:18,202 E 36687 36687] logging.cc:440:     @     0x7fb005816520   47342376  (unknown)
(RayWorkerWrapper pid=36687) [2024-08-24 11:05:18,205 E 36687 36687] logging.cc:440:     @     0x7fafac8ad900  (unknown)  (unknown)
(RayWorkerWrapper pid=36687) [2024-08-24 11:05:18,211 E 36687 36687] logging.cc:440:     @           0x95e040  (unknown)  (unknown)
(RayWorkerWrapper pid=36687) Fatal Python error: Segmentation fault
(RayWorkerWrapper pid=36687) 
(RayWorkerWrapper pid=36687) Stack (most recent call first):
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 223 in __init__
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1069 in call_JitFunction
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1109 in visit_Call
(RayWorkerWrapper pid=36687)   File "/usr/lib/python3.11/ast.py", line 410 in visit
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1204 in visit
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 897 in <listcomp>
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 897 in visit_For
(RayWorkerWrapper pid=36687)   File "/usr/lib/python3.11/ast.py", line 410 in visit
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1204 in visit
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 351 in visit_compound_statement
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 443 in visit_FunctionDef
(RayWorkerWrapper pid=36687)   File "/usr/lib/python3.11/ast.py", line 410 in visit
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1204 in visit
(RayWorkerWrapper pid=36687)   File "/usr/lib/python3.11/ast.py", line 418 in generic_visit
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 359 in visit_Module
(RayWorkerWrapper pid=36687)   File "/usr/lib/python3.11/ast.py", line 410 in visit
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1204 in visit
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1297 in ast_to_ttir
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/compiler.py", line 113 in make_ir
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/compiler.py", line 276 in compile
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/runtime/jit.py", line 662 in run
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/runtime/jit.py", line 345 in <lambda>
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 246 in invoke_fused_moe_kernel
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 531 in fused_experts
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 613 in fused_moe
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 92 in forward_cuda
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/custom_op.py", line 13 in forward
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 75 in apply
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 250 in forward
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 100 in forward
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 243 in forward
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 296 in forward
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 374 in forward
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1363 in execute_model
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 940 in profile_run
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/worker/worker.py", line 179 in determine_num_available_blocks
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 378 in execute_method
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/ray/util/tracing/tracing_helper.py", line 467 in _resume_span
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/ray/_private/function_manager.py", line 691 in actor_method_executor
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/ray/_private/worker.py", line 887 in main_loop
(RayWorkerWrapper pid=36687)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/ray/_private/workers/default_worker.py", line 289 in <module>
(RayWorkerWrapper pid=36687) 
(RayWorkerWrapper pid=36687) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, _brotli, simplejson._speedups, uvloop.loop, ray._raylet, pvectorc, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, sentencepiece._sentencepiece, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, PIL._imaging, regex._regex, scipy._lib._ccallback_c, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, snappy._snappy, lz4._version, lz4.frame._frame, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.tslib, pandas._libs.lib, pandas._libs.hashing, pyarrow.lib, pyarrow._hdfsio, pandas._libs.ops, pyarrow._compute, pandas._libs.arrays, pandas._libs.index, pandas._libs.join, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.internals, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.tslibs.strptime, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json, _cffi_backend, pyarrow._parquet, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, xxhash._xxhash, pyarrow._json, markupsafe._speedups, zmq.libzmq, zmq.backend.cython.context, zmq.backend.cython.message, zmq.backend.cython.socket, zmq.backend.cython._device, zmq.backend.cython._poll, zmq.backend.cython._proxy_steerable, zmq.backend.cython._version, zmq.backend.cython.error, zmq.backend.cython.utils, cuda_utils, __triton_launcher (total: 117)
*** SIGSEGV received at time=1724497518 on cpu 60 ***
PC: @           0x5266a0  (unknown)  (unknown)
    @     0x7fe5bdf5d520  (unknown)  (unknown)
    @     0x7fe44722dc00  (unknown)  (unknown)
    @           0x95e040  (unknown)  (unknown)
[2024-08-24 11:05:18,355 E 11842 11842] logging.cc:440: *** SIGSEGV received at time=1724497518 on cpu 60 ***
[2024-08-24 11:05:18,359 E 11842 11842] logging.cc:440: PC: @           0x5266a0  (unknown)  (unknown)
[2024-08-24 11:05:18,359 E 11842 11842] logging.cc:440:     @     0x7fe5bdf5d520  (unknown)  (unknown)
[2024-08-24 11:05:18,362 E 11842 11842] logging.cc:440:     @     0x7fe44722dc00  (unknown)  (unknown)
[2024-08-24 11:05:18,369 E 11842 11842] logging.cc:440:     @           0x95e040  (unknown)  (unknown)
Fatal Python error: Segmentation fault

Stack (most recent call first):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 223 in __init__
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1069 in call_JitFunction
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1109 in visit_Call
  File "/usr/lib/python3.11/ast.py", line 410 in visit
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1204 in visit
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 897 in <listcomp>
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 897 in visit_For
  File "/usr/lib/python3.11/ast.py", line 410 in visit
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1204 in visit
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 351 in visit_compound_statement
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 443 in visit_FunctionDef
  File "/usr/lib/python3.11/ast.py", line 410 in visit
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1204 in visit
  File "/usr/lib/python3.11/ast.py", line 418 in generic_visit
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 359 in visit_Module
  File "/usr/lib/python3.11/ast.py", line 410 in visit
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1204 in visit
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1297 in ast_to_ttir
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/compiler.py", line 113 in make_ir
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/compiler.py", line 276 in compile
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/runtime/jit.py", line 662 in run
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/runtime/jit.py", line 345 in <lambda>
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 246 in invoke_fused_moe_kernel
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 531 in fused_experts
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 613 in fused_moe
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 92 in forward_cuda
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/custom_op.py", line 13 in forward
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 75 in apply
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 250 in forward
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 100 in forward
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 243 in forward
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 296 in forward
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 374 in forward
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1363 in execute_model
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 940 in profile_run
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/worker/worker.py", line 179 in determine_num_available_blocks
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 378 in execute_method
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 372 in _run_workers
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 38 in determine_num_available_blocks
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 362 in _initialize_kv_caches
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 263 in __init__
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 552 in _init_engine
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 381 in __init__
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 471 in from_engine_args
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/entrypoints/openai/rpc/server.py", line 25 in __init__
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/entrypoints/openai/rpc/server.py", line 217 in run_rpc_server
  File "/usr/lib/python3.11/multiprocessing/process.py", line 108 in run
  File "/usr/lib/python3.11/multiprocessing/process.py", line 314 in _bootstrap
  File "/usr/lib/python3.11/multiprocessing/popen_fork.py", line 71 in _launch
  File "/usr/lib/python3.11/multiprocessing/popen_fork.py", line 19 in __init__
  File "/usr/lib/python3.11/multiprocessing/context.py", line 281 in _Popen
  File "/usr/lib/python3.11/multiprocessing/context.py", line 224 in _Popen
  File "/usr/lib/python3.11/multiprocessing/process.py", line 121 in start
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 115 in build_async_engine_client
  File "/usr/lib/python3.11/contextlib.py", line 204 in __aenter__
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 342 in run_server
  File "/usr/lib/python3.11/asyncio/events.py", line 80 in _run
  File "/usr/lib/python3.11/asyncio/base_events.py", line 1909 in _run_once
  File "/usr/lib/python3.11/asyncio/base_events.py", line 604 in run_forever
  File "/usr/lib/python3.11/asyncio/base_events.py", line 637 in run_until_complete
  File "/usr/lib/python3.11/asyncio/runners.py", line 120 in run
  File "/usr/lib/python3.11/asyncio/runners.py", line 188 in run
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/scripts.py", line 30 in serve
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/scripts.py", line 149 in main
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/bin/vllm", line 8 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, _brotli, simplejson._speedups, yaml._yaml, psutil._psutil_linux, psutil._psutil_posix, sentencepiece._sentencepiece, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, pvectorc, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, PIL._imaging, regex._regex, scipy._lib._ccallback_c, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, snappy._snappy, lz4._version, lz4.frame._frame, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.tslib, pandas._libs.lib, pandas._libs.hashing, pyarrow.lib, pyarrow._hdfsio, pandas._libs.ops, pyarrow._compute, pandas._libs.arrays, pandas._libs.index, pandas._libs.join, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.internals, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.tslibs.strptime, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json, _cffi_backend, pyarrow._parquet, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, xxhash._xxhash, pyarrow._json, markupsafe._speedups, ujson, zmq.libzmq, zmq.backend.cython.context, zmq.backend.cython.message, zmq.backend.cython.socket, zmq.backend.cython._device, zmq.backend.cython._poll, zmq.backend.cython._proxy_steerable, zmq.backend.cython._version, zmq.backend.cython.error, zmq.backend.cython.utils, grpc._cython.cygrpc, cuda_utils, __triton_launcher (total: 119)
/usr/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

nivibilla commented 2 months ago

Mixtral works when i revert to vllm=0.2.7 (where the fused_moe was not implemented yet)

mgoin commented 2 months ago

@nivibilla I am able to load mixtral fp16 (mistralai/Mixtral-8x7B-Instruct-v0.1) just fine with latest release vllm==0.5.5 on 8xL40s

The output of `python collect_env.py`

```text python collect_env.py Collecting environment information... PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.3 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.30.1 Libc version: glibc-2.35 Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-5.15.0-94-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 12.5.82 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA L40S GPU 1: NVIDIA L40S GPU 2: NVIDIA L40S GPU 3: NVIDIA L40S GPU 4: NVIDIA L40S GPU 5: NVIDIA L40S GPU 6: NVIDIA L40S GPU 7: NVIDIA L40S GPU 8: NVIDIA L40S GPU 9: NVIDIA L40S Nvidia driver version: 555.42.06 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 80 On-line CPU(s) list: 0-79 Vendor ID: AuthenticAMD Model name: AMD EPYC 9254 24-Core Processor CPU family: 25 Model: 17 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 10 Stepping: 1 BogoMIPS: 5799.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr wbnoinvd arat npt lbrv nrip_save tsc_scale vmcb_clean pausefilter pfthreshold v_vmsave_vmload vgif avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid fsrm flush_l1d arch_capabilities Virtualization: AMD-V Hypervisor vendor: KVM Virtualization type: full L1d cache: 1.3 MiB (40 instances) L1i cache: 1.3 MiB (40 instances) L2 cache: 40 MiB (40 instances) L3 cache: 160 MiB (5 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-39 NUMA node1 CPU(s): 40-79 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.6.20 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.2.0 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.44.2 [pip3] triton==3.0.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.5@09c7792610ada9f88bbf87d32b472dd44bf23cc2 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU8 GPU9 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PHB PHB PHB PHB SYS SYS SYS SYS SYS 0-39 0 N/A GPU1 PHB X PHB PHB PHB SYS SYS SYS SYS SYS 0-39 0 N/A GPU2 PHB PHB X PHB PHB SYS SYS SYS SYS SYS 0-39 0 N/A GPU3 PHB PHB PHB X PHB SYS SYS SYS SYS SYS 0-39 0 N/A GPU4 PHB PHB PHB PHB X SYS SYS SYS SYS SYS 0-39 0 N/A GPU5 SYS SYS SYS SYS SYS X PHB PHB PHB PHB 40-79 1 N/A GPU6 SYS SYS SYS SYS SYS PHB X PHB PHB PHB 40-79 1 N/A GPU7 SYS SYS SYS SYS SYS PHB PHB X PHB PHB 40-79 1 N/A GPU8 SYS SYS SYS SYS SYS PHB PHB PHB X PHB 40-79 1 N/A GPU9 SYS SYS SYS SYS SYS PHB PHB PHB PHB X 40-79 1 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```

Output

(vllm-rel) ➜  ~ vllm serve mistralai/Mixtral-8x7B-Instruct-v0.1 --tensor-parallel-size 8
INFO 08-27 18:09:03 api_server.py:440] vLLM API server version 0.5.5
INFO 08-27 18:09:03 api_server.py:441] args: Namespace(model_tag='mistralai/Mixtral-8x7B-Instruct-v0.1', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7f9bc2da8280>)
INFO 08-27 18:09:03 api_server.py:144] Multiprocessing frontend to use ipc:///tmp/524d4c7b-0851-4f1d-a71b-a9b8ebc199c7 for RPC Path.
INFO 08-27 18:09:03 api_server.py:161] Started engine process with PID 2193797
INFO 08-27 18:09:07 config.py:813] Defaulting to use mp for distributed inference
INFO 08-27 18:09:07 llm_engine.py:184] Initializing an LLM engine (v0.5.5) with config: model='mistralai/Mixtral-8x7B-Instruct-v0.1', speculative_config=None, tokenizer='mistralai/Mixtral-8x7B-Instruct-v0.1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=mistralai/Mixtral-8x7B-Instruct-v0.1, use_v2_block_manager=False, enable_prefix_caching=False)
WARNING 08-27 18:09:07 multiproc_gpu_executor.py:59] Reducing Torch parallelism from 40 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 08-27 18:09:07 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=2193935) INFO 08-27 18:09:07 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2193937) INFO 08-27 18:09:07 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2193936) INFO 08-27 18:09:07 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2193938) INFO 08-27 18:09:07 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2193939) INFO 08-27 18:09:07 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2193941) INFO 08-27 18:09:07 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2193940) INFO 08-27 18:09:07 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 08-27 18:09:10 utils.py:975] Found nccl from library libnccl.so.2
INFO 08-27 18:09:10 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2193936) INFO 08-27 18:09:10 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2193935) INFO 08-27 18:09:10 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2193937) INFO 08-27 18:09:10 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2193937) INFO 08-27 18:09:10 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2193936) INFO 08-27 18:09:10 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2193935) INFO 08-27 18:09:10 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2193938) INFO 08-27 18:09:10 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2193938) INFO 08-27 18:09:10 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2193939) INFO 08-27 18:09:10 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2193941) INFO 08-27 18:09:10 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2193939) INFO 08-27 18:09:10 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2193941) INFO 08-27 18:09:10 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2193940) INFO 08-27 18:09:10 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2193940) INFO 08-27 18:09:10 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2193938) WARNING 08-27 18:09:10 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=2193937) WARNING 08-27 18:09:10 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=2193936) WARNING 08-27 18:09:10 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=2193941) WARNING 08-27 18:09:10 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=2193939) WARNING 08-27 18:09:10 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 08-27 18:09:10 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=2193935) WARNING 08-27 18:09:10 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=2193940) WARNING 08-27 18:09:10 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 08-27 18:09:10 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fd8bc99f310>, local_subscribe_port=58501, remote_subscribe_port=None)
INFO 08-27 18:09:10 model_runner.py:879] Starting to load model mistralai/Mixtral-8x7B-Instruct-v0.1...
(VllmWorkerProcess pid=2193936) INFO 08-27 18:09:10 model_runner.py:879] Starting to load model mistralai/Mixtral-8x7B-Instruct-v0.1...
(VllmWorkerProcess pid=2193937) INFO 08-27 18:09:10 model_runner.py:879] Starting to load model mistralai/Mixtral-8x7B-Instruct-v0.1...
(VllmWorkerProcess pid=2193939) INFO 08-27 18:09:10 model_runner.py:879] Starting to load model mistralai/Mixtral-8x7B-Instruct-v0.1...
(VllmWorkerProcess pid=2193935) INFO 08-27 18:09:10 model_runner.py:879] Starting to load model mistralai/Mixtral-8x7B-Instruct-v0.1...
(VllmWorkerProcess pid=2193940) INFO 08-27 18:09:10 model_runner.py:879] Starting to load model mistralai/Mixtral-8x7B-Instruct-v0.1...
(VllmWorkerProcess pid=2193938) INFO 08-27 18:09:10 model_runner.py:879] Starting to load model mistralai/Mixtral-8x7B-Instruct-v0.1...
(VllmWorkerProcess pid=2193941) INFO 08-27 18:09:10 model_runner.py:879] Starting to load model mistralai/Mixtral-8x7B-Instruct-v0.1...
(VllmWorkerProcess pid=2193938) INFO 08-27 18:09:11 weight_utils.py:236] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=2193939) INFO 08-27 18:09:11 weight_utils.py:236] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=2193935) INFO 08-27 18:09:11 weight_utils.py:236] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=2193940) INFO 08-27 18:09:11 weight_utils.py:236] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=2193941) INFO 08-27 18:09:11 weight_utils.py:236] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=2193937) INFO 08-27 18:09:11 weight_utils.py:236] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=2193936) INFO 08-27 18:09:11 weight_utils.py:236] Using model weights format ['*.safetensors']
INFO 08-27 18:09:11 weight_utils.py:236] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/19 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   5% Completed | 1/19 [00:00<00:02,  7.08it/s]
Loading safetensors checkpoint shards:  11% Completed | 2/19 [00:00<00:02,  7.82it/s]
Loading safetensors checkpoint shards:  16% Completed | 3/19 [00:00<00:02,  7.48it/s]
Loading safetensors checkpoint shards:  21% Completed | 4/19 [00:00<00:02,  6.95it/s]
Loading safetensors checkpoint shards:  26% Completed | 5/19 [00:00<00:02,  6.53it/s]
Loading safetensors checkpoint shards:  32% Completed | 6/19 [00:00<00:02,  5.97it/s]
Loading safetensors checkpoint shards:  37% Completed | 7/19 [00:01<00:01,  6.17it/s]
Loading safetensors checkpoint shards:  42% Completed | 8/19 [00:01<00:01,  6.11it/s]
Loading safetensors checkpoint shards:  47% Completed | 9/19 [00:01<00:01,  6.20it/s]
Loading safetensors checkpoint shards:  53% Completed | 10/19 [00:01<00:01,  6.38it/s]
Loading safetensors checkpoint shards:  58% Completed | 11/19 [00:01<00:01,  6.03it/s]
Loading safetensors checkpoint shards:  63% Completed | 12/19 [00:01<00:01,  5.85it/s]
Loading safetensors checkpoint shards:  68% Completed | 13/19 [00:02<00:01,  5.56it/s]
Loading safetensors checkpoint shards:  74% Completed | 14/19 [00:02<00:00,  5.77it/s]
Loading safetensors checkpoint shards:  79% Completed | 15/19 [00:02<00:00,  5.83it/s]
Loading safetensors checkpoint shards:  84% Completed | 16/19 [00:02<00:00,  5.99it/s]
Loading safetensors checkpoint shards:  89% Completed | 17/19 [00:02<00:00,  5.98it/s]
Loading safetensors checkpoint shards:  95% Completed | 18/19 [00:02<00:00,  5.48it/s]
Loading safetensors checkpoint shards: 100% Completed | 19/19 [00:03<00:00,  5.51it/s]
Loading safetensors checkpoint shards: 100% Completed | 19/19 [00:03<00:00,  6.01it/s]

(VllmWorkerProcess pid=2193939) INFO 08-27 18:09:14 model_runner.py:890] Loading model weights took 10.8853 GB
(VllmWorkerProcess pid=2193938) INFO 08-27 18:09:14 model_runner.py:890] Loading model weights took 10.8853 GB
(VllmWorkerProcess pid=2193935) INFO 08-27 18:09:14 model_runner.py:890] Loading model weights took 10.8853 GB
(VllmWorkerProcess pid=2193940) INFO 08-27 18:09:14 model_runner.py:890] Loading model weights took 10.8853 GB
INFO 08-27 18:09:14 model_runner.py:890] Loading model weights took 10.8853 GB
(VllmWorkerProcess pid=2193937) INFO 08-27 18:09:14 model_runner.py:890] Loading model weights took 10.8853 GB
(VllmWorkerProcess pid=2193936) INFO 08-27 18:09:15 model_runner.py:890] Loading model weights took 10.8853 GB
(VllmWorkerProcess pid=2193941) INFO 08-27 18:09:15 model_runner.py:890] Loading model weights took 10.8853 GB
INFO 08-27 18:09:20 distributed_gpu_executor.py:56] # GPU blocks: 107636, # CPU blocks: 16384
(VllmWorkerProcess pid=2193935) INFO 08-27 18:09:21 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=2193935) INFO 08-27 18:09:21 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=2193940) INFO 08-27 18:09:21 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=2193940) INFO 08-27 18:09:21 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=2193936) INFO 08-27 18:09:21 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=2193936) INFO 08-27 18:09:21 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=2193941) INFO 08-27 18:09:21 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=2193941) INFO 08-27 18:09:21 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=2193939) INFO 08-27 18:09:21 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=2193937) INFO 08-27 18:09:21 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=2193937) INFO 08-27 18:09:21 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=2193939) INFO 08-27 18:09:21 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 08-27 18:09:21 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=2193938) INFO 08-27 18:09:21 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 08-27 18:09:21 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=2193938) INFO 08-27 18:09:21 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=2193937) INFO 08-27 18:09:36 model_runner.py:1300] Graph capturing finished in 15 secs.
(VllmWorkerProcess pid=2193941) INFO 08-27 18:09:36 model_runner.py:1300] Graph capturing finished in 16 secs.
(VllmWorkerProcess pid=2193939) INFO 08-27 18:09:36 model_runner.py:1300] Graph capturing finished in 15 secs.
(VllmWorkerProcess pid=2193935) INFO 08-27 18:09:36 model_runner.py:1300] Graph capturing finished in 16 secs.
(VllmWorkerProcess pid=2193940) INFO 08-27 18:09:36 model_runner.py:1300] Graph capturing finished in 16 secs.
(VllmWorkerProcess pid=2193936) INFO 08-27 18:09:36 model_runner.py:1300] Graph capturing finished in 16 secs.
(VllmWorkerProcess pid=2193938) INFO 08-27 18:09:36 model_runner.py:1300] Graph capturing finished in 15 secs.
INFO 08-27 18:09:36 model_runner.py:1300] Graph capturing finished in 15 secs.
INFO 08-27 18:09:37 api_server.py:209] vLLM to use /tmp/tmp4co5wqy1 as PROMETHEUS_MULTIPROC_DIR
WARNING 08-27 18:09:37 serving_embedding.py:188] embedding_mode is False. Embedding API will not work.
INFO 08-27 18:09:37 launcher.py:20] Available routes are:
INFO 08-27 18:09:37 launcher.py:28] Route: /openapi.json, Methods: GET, HEAD
INFO 08-27 18:09:37 launcher.py:28] Route: /docs, Methods: GET, HEAD
INFO 08-27 18:09:37 launcher.py:28] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 08-27 18:09:37 launcher.py:28] Route: /redoc, Methods: GET, HEAD
INFO 08-27 18:09:37 launcher.py:28] Route: /health, Methods: GET
INFO 08-27 18:09:37 launcher.py:28] Route: /tokenize, Methods: POST
INFO 08-27 18:09:37 launcher.py:28] Route: /detokenize, Methods: POST
INFO 08-27 18:09:37 launcher.py:28] Route: /v1/models, Methods: GET
INFO 08-27 18:09:37 launcher.py:28] Route: /version, Methods: GET
INFO 08-27 18:09:37 launcher.py:28] Route: /v1/chat/completions, Methods: POST
INFO 08-27 18:09:37 launcher.py:28] Route: /v1/completions, Methods: POST
INFO 08-27 18:09:37 launcher.py:28] Route: /v1/embeddings, Methods: POST
INFO 08-27 18:09:37 launcher.py:33] Launching Uvicorn with --limit_concurrency 32765. To avoid this limit at the expense of performance run with --disable-frontend-multiprocessing
INFO:     Started server process [2193718]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Also the same works for FP8 with neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8!

(vllm-rel) ➜  ~ vllm serve neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8 --tensor-parallel-size 8
INFO 08-27 18:13:48 api_server.py:440] vLLM API server version 0.5.5
INFO 08-27 18:13:48 api_server.py:441] args: Namespace(model_tag='neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7f8381a60280>)
INFO 08-27 18:13:48 api_server.py:144] Multiprocessing frontend to use ipc:///tmp/18e76b0e-7b36-4b76-b091-9d9c89bd34ee for RPC Path.
INFO 08-27 18:13:48 api_server.py:161] Started engine process with PID 2195440
INFO 08-27 18:13:52 config.py:813] Defaulting to use mp for distributed inference
INFO 08-27 18:13:52 llm_engine.py:184] Initializing an LLM engine (v0.5.5) with config: model='neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8', speculative_config=None, tokenizer='neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8, use_v2_block_manager=False, enable_prefix_caching=False)
WARNING 08-27 18:13:52 multiproc_gpu_executor.py:59] Reducing Torch parallelism from 40 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 08-27 18:13:52 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=2195578) INFO 08-27 18:13:52 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2195579) INFO 08-27 18:13:52 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2195580) INFO 08-27 18:13:52 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2195582) INFO 08-27 18:13:53 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2195581) INFO 08-27 18:13:53 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2195585) INFO 08-27 18:13:53 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2195586) INFO 08-27 18:13:53 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2195581) INFO 08-27 18:13:55 utils.py:975] Found nccl from library libnccl.so.2
INFO 08-27 18:13:55 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2195581) INFO 08-27 18:13:55 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 08-27 18:13:55 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2195580) INFO 08-27 18:13:55 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2195578) INFO 08-27 18:13:55 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2195580) INFO 08-27 18:13:55 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2195579) INFO 08-27 18:13:55 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2195578) INFO 08-27 18:13:55 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2195585) INFO 08-27 18:13:55 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2195582) INFO 08-27 18:13:55 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2195579) INFO 08-27 18:13:55 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2195586) INFO 08-27 18:13:55 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2195582) INFO 08-27 18:13:55 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2195585) INFO 08-27 18:13:55 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2195586) INFO 08-27 18:13:55 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2195582) WARNING 08-27 18:13:55 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=2195585) WARNING 08-27 18:13:55 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=2195586) WARNING 08-27 18:13:55 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=2195580) WARNING 08-27 18:13:55 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=2195578) WARNING 08-27 18:13:55 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=2195581) WARNING 08-27 18:13:55 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=2195579) WARNING 08-27 18:13:55 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 08-27 18:13:55 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 08-27 18:13:55 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fc240fd3880>, local_subscribe_port=59205, remote_subscribe_port=None)
INFO 08-27 18:13:55 model_runner.py:879] Starting to load model neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8...
(VllmWorkerProcess pid=2195578) INFO 08-27 18:13:55 model_runner.py:879] Starting to load model neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8...
(VllmWorkerProcess pid=2195585) INFO 08-27 18:13:55 model_runner.py:879] Starting to load model neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8...
(VllmWorkerProcess pid=2195586) INFO 08-27 18:13:55 model_runner.py:879] Starting to load model neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8...
(VllmWorkerProcess pid=2195581) INFO 08-27 18:13:55 model_runner.py:879] Starting to load model neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8...
(VllmWorkerProcess pid=2195582) INFO 08-27 18:13:55 model_runner.py:879] Starting to load model neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8...
(VllmWorkerProcess pid=2195580) INFO 08-27 18:13:55 model_runner.py:879] Starting to load model neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8...
(VllmWorkerProcess pid=2195579) INFO 08-27 18:13:55 model_runner.py:879] Starting to load model neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8...
WARNING 08-27 18:13:55 fp8.py:46] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
(VllmWorkerProcess pid=2195585) WARNING 08-27 18:13:55 fp8.py:46] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
(VllmWorkerProcess pid=2195582) WARNING 08-27 18:13:55 fp8.py:46] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
(VllmWorkerProcess pid=2195586) WARNING 08-27 18:13:55 fp8.py:46] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
(VllmWorkerProcess pid=2195581) WARNING 08-27 18:13:55 fp8.py:46] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
(VllmWorkerProcess pid=2195579) WARNING 08-27 18:13:55 fp8.py:46] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
(VllmWorkerProcess pid=2195580) WARNING 08-27 18:13:55 fp8.py:46] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
(VllmWorkerProcess pid=2195578) WARNING 08-27 18:13:55 fp8.py:46] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
INFO 08-27 18:13:56 weight_utils.py:236] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=2195580) INFO 08-27 18:13:56 weight_utils.py:236] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=2195585) INFO 08-27 18:13:56 weight_utils.py:236] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=2195582) INFO 08-27 18:13:56 weight_utils.py:236] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=2195581) INFO 08-27 18:13:56 weight_utils.py:236] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=2195578) INFO 08-27 18:13:56 weight_utils.py:236] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=2195579) INFO 08-27 18:13:56 weight_utils.py:236] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=2195586) INFO 08-27 18:13:56 weight_utils.py:236] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/10 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  10% Completed | 1/10 [00:00<00:01,  6.42it/s]
Loading safetensors checkpoint shards:  20% Completed | 2/10 [00:00<00:01,  6.24it/s]
Loading safetensors checkpoint shards:  30% Completed | 3/10 [00:00<00:01,  5.96it/s]
Loading safetensors checkpoint shards:  40% Completed | 4/10 [00:00<00:01,  5.90it/s]
Loading safetensors checkpoint shards:  50% Completed | 5/10 [00:00<00:00,  5.79it/s]
Loading safetensors checkpoint shards:  60% Completed | 6/10 [00:01<00:00,  5.74it/s]
Loading safetensors checkpoint shards:  80% Completed | 8/10 [00:01<00:00,  6.50it/s]
Loading safetensors checkpoint shards:  90% Completed | 9/10 [00:01<00:00,  6.09it/s]
Loading safetensors checkpoint shards: 100% Completed | 10/10 [00:01<00:00,  5.91it/s]
Loading safetensors checkpoint shards: 100% Completed | 10/10 [00:01<00:00,  6.00it/s]

WARNING 08-27 18:13:57 utils.py:721] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer. 
(VllmWorkerProcess pid=2195580) WARNING 08-27 18:13:57 utils.py:721] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer. 
(VllmWorkerProcess pid=2195582) WARNING 08-27 18:13:58 utils.py:721] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer. 
INFO 08-27 18:13:58 model_runner.py:890] Loading model weights took 5.4791 GB
(VllmWorkerProcess pid=2195580) INFO 08-27 18:13:58 model_runner.py:890] Loading model weights took 5.4791 GB
(VllmWorkerProcess pid=2195585) WARNING 08-27 18:13:58 utils.py:721] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer. 
(VllmWorkerProcess pid=2195581) WARNING 08-27 18:13:58 utils.py:721] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer. 
(VllmWorkerProcess pid=2195582) INFO 08-27 18:13:58 model_runner.py:890] Loading model weights took 5.4791 GB
(VllmWorkerProcess pid=2195585) INFO 08-27 18:13:58 model_runner.py:890] Loading model weights took 5.4791 GB
(VllmWorkerProcess pid=2195578) WARNING 08-27 18:13:58 utils.py:721] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer. 
(VllmWorkerProcess pid=2195581) INFO 08-27 18:13:58 model_runner.py:890] Loading model weights took 5.4791 GB
(VllmWorkerProcess pid=2195579) WARNING 08-27 18:13:58 utils.py:721] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer. 
(VllmWorkerProcess pid=2195578) INFO 08-27 18:13:58 model_runner.py:890] Loading model weights took 5.4791 GB
(VllmWorkerProcess pid=2195586) WARNING 08-27 18:13:58 utils.py:721] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer. 
(VllmWorkerProcess pid=2195579) INFO 08-27 18:13:58 model_runner.py:890] Loading model weights took 5.4791 GB
(VllmWorkerProcess pid=2195586) INFO 08-27 18:13:58 model_runner.py:890] Loading model weights took 5.4791 GB
INFO 08-27 18:14:03 distributed_gpu_executor.py:56] # GPU blocks: 129516, # CPU blocks: 16384
(VllmWorkerProcess pid=2195578) INFO 08-27 18:14:04 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=2195578) INFO 08-27 18:14:04 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=2195582) INFO 08-27 18:14:04 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=2195582) INFO 08-27 18:14:04 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=2195586) INFO 08-27 18:14:05 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=2195586) INFO 08-27 18:14:05 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=2195581) INFO 08-27 18:14:05 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=2195581) INFO 08-27 18:14:05 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=2195579) INFO 08-27 18:14:05 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=2195579) INFO 08-27 18:14:05 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=2195585) INFO 08-27 18:14:05 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=2195585) INFO 08-27 18:14:05 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=2195580) INFO 08-27 18:14:05 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=2195580) INFO 08-27 18:14:05 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 08-27 18:14:05 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 08-27 18:14:05 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=2195581) INFO 08-27 18:14:20 model_runner.py:1300] Graph capturing finished in 15 secs.
INFO 08-27 18:14:20 model_runner.py:1300] Graph capturing finished in 15 secs.
(VllmWorkerProcess pid=2195582) INFO 08-27 18:14:20 model_runner.py:1300] Graph capturing finished in 16 secs.
(VllmWorkerProcess pid=2195578) INFO 08-27 18:14:20 model_runner.py:1300] Graph capturing finished in 16 secs.
(VllmWorkerProcess pid=2195585) INFO 08-27 18:14:20 model_runner.py:1300] Graph capturing finished in 15 secs.
(VllmWorkerProcess pid=2195580) INFO 08-27 18:14:20 model_runner.py:1300] Graph capturing finished in 15 secs.
(VllmWorkerProcess pid=2195579) INFO 08-27 18:14:20 model_runner.py:1300] Graph capturing finished in 15 secs.
(VllmWorkerProcess pid=2195586) INFO 08-27 18:14:20 model_runner.py:1300] Graph capturing finished in 15 secs.
INFO 08-27 18:14:20 api_server.py:209] vLLM to use /tmp/tmphfyfhzvv as PROMETHEUS_MULTIPROC_DIR
WARNING 08-27 18:14:20 serving_embedding.py:188] embedding_mode is False. Embedding API will not work.
INFO 08-27 18:14:20 launcher.py:20] Available routes are:
INFO 08-27 18:14:20 launcher.py:28] Route: /openapi.json, Methods: GET, HEAD
INFO 08-27 18:14:20 launcher.py:28] Route: /docs, Methods: GET, HEAD
INFO 08-27 18:14:20 launcher.py:28] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 08-27 18:14:20 launcher.py:28] Route: /redoc, Methods: GET, HEAD
INFO 08-27 18:14:20 launcher.py:28] Route: /health, Methods: GET
INFO 08-27 18:14:20 launcher.py:28] Route: /tokenize, Methods: POST
INFO 08-27 18:14:20 launcher.py:28] Route: /detokenize, Methods: POST
INFO 08-27 18:14:20 launcher.py:28] Route: /v1/models, Methods: GET
INFO 08-27 18:14:20 launcher.py:28] Route: /version, Methods: GET
INFO 08-27 18:14:20 launcher.py:28] Route: /v1/chat/completions, Methods: POST
INFO 08-27 18:14:20 launcher.py:28] Route: /v1/completions, Methods: POST
INFO 08-27 18:14:20 launcher.py:28] Route: /v1/embeddings, Methods: POST
INFO 08-27 18:14:20 launcher.py:33] Launching Uvicorn with --limit_concurrency 32765. To avoid this limit at the expense of performance run with --disable-frontend-multiprocessing
INFO:     Started server process [2195363]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

nivibilla commented 2 months ago

@mgoin thanks for confirming, everything is identical between your setup and mine, apart that i am using L4 gpus and you are using L40s, however one difference i noted is my cluster (im using databricks g6.48x) is on a much older driver 535 whereas you are on 555. Maybe thats the issue. i will raise this with databricks.

nivibilla commented 2 months ago

However would it worth to add an option to disable fused_moe for cases like this where the driver cannot be updated so easily?

mgoin commented 2 months ago

The error is happening in the triton compiler, so it seems like an unenforced requirement from triton. We could try to make a wrapper for this, but we would have to find the minimum driver...

nivibilla commented 2 months ago

I tried updating the driver, it installed but i get permission denied when i try to reboot the gpu. Databricks is such pain. Im basically stuck with driver 535.161.07

nivibilla commented 2 months ago

@mgoin i got the same error on A10s aswell, this time with deepseek lite. Same place, the fused_moe kernel

open me

```text --2024-08-27 20:44:23-- https://raw.githubusercontent.com/vllm-project/vllm/main/collect_env.py Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 25265 (25K) [text/plain] Saving to: ‘collect_env.py’ 0K .......... .......... .... 100% 78.5M=0s 2024-08-27 20:44:23 (78.5 MB/s) - ‘collect_env.py’ saved [25265/25265] 2024-08-27 20:44:36.954371: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. Collecting environment information... PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.22.1 Libc version: glibc-2.35 Python version: 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-1067-aws-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A10G GPU 1: NVIDIA A10G GPU 2: NVIDIA A10G GPU 3: NVIDIA A10G Nvidia driver version: 535.161.07 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Vendor ID: AuthenticAMD Model name: AMD EPYC 7R32 CPU family: 23 Model: 49 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 1 Stepping: 0 BogoMIPS: 5599.62 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid Hypervisor vendor: KVM Virtualization type: full L1d cache: 768 KiB (24 instances) L1i cache: 768 KiB (24 instances) L2 cache: 12 MiB (24 instances) L3 cache: 96 MiB (6 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-47 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection Vulnerability Spec rstack overflow: Mitigation; safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.23.5 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.555.43 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.5.82 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] optree==0.12.1 [pip3] pyzmq==23.2.0 [pip3] sentence-transformers==2.7.0 [pip3] torch==2.4.0 [pip3] torcheval==0.0.7 [pip3] torchvision==0.19.0 [pip3] transformers==4.44.2 [pip3] triton==3.0.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.5@09c7792610ada9f88bbf87d32b472dd44bf23cc2 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PHB PHB PHB 0-47 0 N/A GPU1 PHB X PHB PHB 0-47 0 N/A GPU2 PHB PHB X PHB 0-47 0 N/A GPU3 PHB PHB PHB X 0-47 0 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ``` error ``` 2024-08-27 20:55:48.957899: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. INFO 08-27 20:55:52 api_server.py:440] vLLM API server version 0.5.5 INFO 08-27 20:55:52 api_server.py:441] args: Namespace(model_tag='/local_disk0/deepseek-ai/DeepSeek-V2-Lite-Chat/', host='0.0.0.0', port=1234, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='/local_disk0/deepseek-ai/DeepSeek-V2-Lite-Chat/', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=8192, guided_decoding_backend='outlines', distributed_executor_backend='ray', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=True, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=32, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['deepseek-v2-lite-16b-chat'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=) INFO 08-27 20:55:53 api_server.py:144] Multiprocessing frontend to use ipc:///tmp/57d97d1a-8401-43fc-9be2-d580277c6a52 for RPC Path. INFO 08-27 20:55:53 api_server.py:161] Started engine process with PID 16418 2024-08-27 20:56:15.813319: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-08-27 20:56:40,791 INFO worker.py:1772 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 INFO 08-27 20:56:42 llm_engine.py:184] Initializing an LLM engine (v0.5.5) with config: model='/local_disk0/deepseek-ai/DeepSeek-V2-Lite-Chat/', speculative_config=None, tokenizer='/local_disk0/deepseek-ai/DeepSeek-V2-Lite-Chat/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=deepseek-v2-lite-16b-chat, use_v2_block_manager=True, enable_prefix_caching=True) INFO 08-27 20:56:43 ray_gpu_executor.py:133] use_ray_spmd_worker: False (pid=20545) 2024-08-27 20:57:02.768344: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. (pid=20545) To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. (pid=21471) 2024-08-27 20:57:30.044815: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. (pid=21471) To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. (pid=21842) 2024-08-27 20:57:51.578800: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. (pid=21842) To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. (pid=22161) 2024-08-27 20:58:13.065238: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. (pid=22161) To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. INFO 08-27 20:58:25 utils.py:975] Found nccl from library libnccl.so.2 INFO 08-27 20:58:25 pynccl.py:63] vLLM is using nccl==2.22.3 (RayWorkerWrapper pid=21471) INFO 08-27 20:58:25 utils.py:975] Found nccl from library libnccl.so.2 (RayWorkerWrapper pid=21471) INFO 08-27 20:58:25 pynccl.py:63] vLLM is using nccl==2.22.3 WARNING 08-27 20:58:25 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. INFO 08-27 20:58:25 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer=, local_subscribe_port=39171, remote_subscribe_port=None) INFO 08-27 20:58:25 model_runner.py:879] Starting to load model /local_disk0/deepseek-ai/DeepSeek-V2-Lite-Chat/... (RayWorkerWrapper pid=21471) WARNING 08-27 20:58:25 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. (RayWorkerWrapper pid=21471) INFO 08-27 20:58:25 model_runner.py:879] Starting to load model /local_disk0/deepseek-ai/DeepSeek-V2-Lite-Chat/... Cache shape torch.Size([163840, 64]) Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00 File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 897 in visit_For File "/usr/lib/python3.11/ast.py", line 410 in visit File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1204 in visit File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 351 in visit_compound_statement File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 443 in visit_FunctionDef File "/usr/lib/python3.11/ast.py", line 410 in visit File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1204 in visit File "/usr/lib/python3.11/ast.py", line 418 in generic_visit File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 359 in visit_Module File "/usr/lib/python3.11/ast.py", line 410 in visit File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1204 in visit File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1297 in ast_to_ttir File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/triton/compiler/compiler.py", line 113 in make_ir File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/triton/compiler/compiler.py", line 276 in compile File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/triton/runtime/jit.py", line 662 in run File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/triton/runtime/jit.py", line 345 in File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 258 in invoke_fused_moe_kernel File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 565 in fused_experts File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 99 in forward_cuda File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/vllm/model_executor/custom_op.py", line 14 in forward File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 68 in apply File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 287 in forward File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v2.py", line 148 in forward File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v2.py", line 401 in forward File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v2.py", line 461 in forward File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v2.py", line 504 in forward File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1415 in execute_model File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1097 in profile_run File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/vllm/worker/worker.py", line 222 in determine_num_available_blocks File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 451 in execute_method File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 407 in _run_workers File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 38 in determine_num_available_blocks File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 390 in _initialize_kv_caches File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 284 in __init__ File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 272 in __init__ File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 840 in _init_engine File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 636 in __init__ File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 740 in from_engine_args File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/vllm/entrypoints/openai/rpc/server.py", line 31 in __init__ File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1382ab5-a53c-4a83-8353-b9e1ba4d81f3/lib/python3.11/site-packages/vllm/entrypoints/openai/rpc/server.py", line 230 in run_rpc_server File "/usr/lib/python3.11/multiprocessing/process.py", line 108 in run File "/usr/lib/python3.11/multiprocessing/process.py", line 314 in _bootstrap File "/usr/lib/python3.11/multiprocessing/spawn.py", line 133 in _main File "/usr/lib/python3.11/multiprocessing/spawn.py", line 120 in spawn_main File "", line 1 in Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, _brotli, simplejson._speedups, yaml._yaml, msgspec._core, psutil._psutil_linux, psutil._psutil_posix, sentencepiece._sentencepiece, PIL._imaging, PIL._imagingft, google._upb._message, h5py._errors, h5py.defs, h5py._objects, h5py.h5, h5py.utils, h5py.h5t, h5py.h5s, h5py.h5ac, h5py.h5p, h5py.h5r, h5py._proxy, h5py._conv, h5py.h5z, h5py.h5a, h5py.h5d, h5py.h5ds, h5py.h5g, h5py.h5i, h5py.h5f, h5py.h5fd, h5py.h5pl, h5py.h5o, h5py.h5l, h5py._selector, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._isolve._iterative, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.linalg._flinalg, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.tslib, pandas._libs.lib, pandas._libs.hashing, pyarrow.lib, pyarrow._hdfsio, pandas._libs.ops, pyarrow._compute, pandas._libs.arrays, pandas._libs.index, pandas._libs.join, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.internals, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.tslibs.strptime, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json, msgpack._cmsgpack, setproctitle, uvloop.loop, ray._raylet, pvectorc, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, _cffi_backend, regex._regex, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, snappy._snappy, lz4._version, lz4.frame._frame, pyarrow._parquet, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, xxhash._xxhash, pyarrow._json, markupsafe._speedups, ujson, zmq.libzmq, zmq.backend.cython.context, zmq.backend.cython.message, zmq.backend.cython.socket, zmq.backend.cython._device, zmq.backend.cython._poll, zmq.backend.cython._proxy_steerable, zmq.backend.cython._version, zmq.backend.cython.error, zmq.backend.cython.utils, grpc._cython.cygrpc, cuda_utils, __triton_launcher (total: 169) ERROR 08-27 20:58:38 api_server.py:171] RPCServer process died before responding to readiness probe ```

vllm-project / vllm

[Bug]: Loading Mixtral-8x22B-Instruct-v0.1-FP8 on 8xL40S causes a SIGSEGV #5479

Your current environment

🐛 Describe the bug