vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.42k stars 4.41k forks source link

[Bug]: Assertion `parentOp->getNumRegions() == 1 && parentOp->getRegion(0).getBlocks().size() == 1' failed #3732

Closed NavinKumarMNK closed 7 months ago

NavinKumarMNK commented 7 months ago

Your current environment

root@0fca177ad2d4:/workspace# python3 collect_env.py 
Collecting environment information...
PyTorch version: 2.1.2
Is debug build: False
CUDA used to build PyTorch: 12.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (ppc64le)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.0
Libc version: glibc-2.35

Python version: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 16:04:32) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-100-generic-ppc64le-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.2.91
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB
GPU 2: Tesla V100-SXM2-32GB
GPU 3: Tesla V100-SXM2-32GB

Nvidia driver version: 535.161.07
cuDNN version: Probably one of the following:
/usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn.so.8.9.5
/usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_adv_infer.so.8.9.5
/usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_adv_train.so.8.9.5
/usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_cnn_infer.so.8.9.5
/usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_cnn_train.so.8.9.5
/usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_ops_infer.so.8.9.5
/usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_ops_train.so.8.9.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: False

CPU:
Architecture:                       ppc64le
Byte Order:                         Little Endian
CPU(s):                             128
On-line CPU(s) list:                0-127
Model name:                         POWER9, altivec supported
Model:                              2.2 (pvr 004e 1202)
Thread(s) per core:                 4
Core(s) per socket:                 16
Socket(s):                          2
Frequency boost:                    enabled
CPU max MHz:                        3800.0000
CPU min MHz:                        2300.0000
L1d cache:                          1 MiB (32 instances)
L1i cache:                          1 MiB (32 instances)
L2 cache:                           8 MiB (16 instances)
L3 cache:                           160 MiB (16 instances)
NUMA node(s):                       6
NUMA node0 CPU(s):                  0-63
NUMA node8 CPU(s):                  64-127
NUMA node252 CPU(s):                
NUMA node253 CPU(s):                
NUMA node254 CPU(s):                
NUMA node255 CPU(s):                
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Mitigation; RFI Flush, L1D private per thread
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Mitigation; RFI Flush, L1D private per thread
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Kernel entry/exit barrier (eieio)
Vulnerability Spectre v1:           Mitigation; __user pointer sanitization, ori31 speculation barrier enabled
Vulnerability Spectre v2:           Mitigation; Indirect branch serialisation (kernel only)
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.24.3
[pip3] torch==2.1.2
[conda] cudatoolkit               11.8.0              hedcfb66_13    conda-forge
[conda] libmagma                  2.7.2                he288b6c_2    conda-forge
[conda] libmagma_sparse           2.7.2                h5b5c57a_3    conda-forge
[conda] magma                     2.7.2                h097a1ca_3    conda-forge
[conda] numpy                     1.24.3          py310h87cc683_0  
[conda] numpy-base                1.24.3          py310hac71eb6_0  
[conda] torch                     2.1.2                     dev_0    <develop>ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.3.3
vLLM Build Flags:
CUDA Archs: 7.0; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0  X  NV3 SYS SYS 0-63 0  N/A
GPU1 NV3  X  SYS SYS 0-63 0  N/A
GPU2 SYS SYS  X  NV3 64-127 8   N/A
GPU3 SYS SYS NV3  X  64-127 8   N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

๐Ÿ› Describe the bug

example.py. - i loaded the mixtral-8x7b-instruct fp16 model

from vllm import LLM, SamplingParams
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(
    model="./models", 
    dtype="float16", 
    tensor_parallel_size=4, 
    enforce_eager=True, 
    trust_remote_code=True, 
    load_format='safetensors',
    # quantization="AWQ",
)
root@0fca177ad2d4:/workspace# python3 example.py 
WARNING 03-29 15:24:46 config.py:686] Casting torch.bfloat16 to torch.float16.
2024-03-29 15:24:48,678 INFO worker.py:1612 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
INFO 03-29 15:24:52 llm_engine.py:68] Initializing an LLM engine (v0.3.3) with config: model='./models', tokenizer='./models', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=safetensors, tensor_parallel_size=4, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0)
INFO 03-29 15:25:07 attention.py:67] flash_attn is not found. Using xformers backend.
(RayWorkerVllm pid=37294) INFO 03-29 15:25:07 attention.py:67] flash_attn is not found. Using xformers backend.
INFO 03-29 15:25:27 model_runner.py:97] Loading model weights took 21.7573 GB
(RayWorkerVllm pid=37294) INFO 03-29 15:25:39 model_runner.py:97] Loading model weights took 21.7573 GB
(RayWorkerVllm pid=37345) INFO 03-29 15:25:07 attention.py:67] flash_attn is not found. Using xformers backend. [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
python3: /root/llvm-project/mlir/lib/Analysis/SliceAnalysis.cpp:106: void getBackwardSliceImpl(mlir::Operation *, SetVector<mlir::Operation *> *, mlir::TransitiveFilter): Assertion `parentOp->getNumRegions() == 1 && parentOp->getRegion(0).getBlocks().size() == 1' failed.
*** SIGABRT received at time=1711725941 on cpu 45 ***
PC: @     0x7e79d800866c  (unknown)  pthread_kill
    @     0x7e7143545984  613083184  absl::lts_20220623::AbslFailureSignalHandler()
    @     0x7e79d800870c        224  pthread_kill
    @     0x7e79d7fa1dfc         48  raise
    @     0x7e79d7f7d260        336  abort
    @     0x7e79d7f94ef0        192  (unknown)
    @     0x7e79d7f94f94         64  __assert_fail
    @     0x7e755cc4c8b8        112  getBackwardSliceImpl()
    @     0x7e755cc4c6f0        112  getBackwardSliceImpl()
    @     0x7e755cc4c5a8         64  mlir::getBackwardSlice()
    @     0x7e755c78bc10        384  mlir::multiRootGetSlice()
    @     0x7e755b235e7c        608  CoalescePass::getCoalescedEncoding()
    @     0x7e755b2375d8        256  CoalescePass::runOnOperation()::{lambda()#1}::operator()()
    @     0x7e755b238be0        480  mlir::detail::walk<>()
    @     0x7e755b238eac        320  CoalescePass::runOnOperation()
    @     0x7e755bd83c54        416  mlir::detail::OpToOpPassAdaptor::run()
    @     0x7e755bd84650        160  mlir::detail::OpToOpPassAdaptor::runPipeline()
    @     0x7e755bd878b0        368  mlir::PassManager::run()
    @     0x7e75599cf9cc        128  pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN()
    @     0x7e75599bacd4        848  pybind11::cpp_function::dispatcher()
    @      0x18e02ca5e40        112  cfunction_call
    @      0x18e02a3bc2c        160  _PyObject_MakeTpCall
    @      0x18e02c80134        160  method_vectorcall
    @      0x18e02a25738        480  _PyEval_EvalFrameDefault
    @      0x18e02b2a974         64  _PyEval_Vector
    @      0x18e02a3b9a0         32  _PyFunction_Vectorcall
    @      0x18e02a22b6c        480  _PyEval_EvalFrameDefault
    @      0x18e02b2a974         64  _PyEval_Vector
    @      0x18e02a3b9a0         32  _PyFunction_Vectorcall
    @      0x18e02a22b6c        480  _PyEval_EvalFrameDefault
    @      0x18e02b2a974         64  _PyEval_Vector
    @      0x18e02a3b9a0         32  _PyFunction_Vectorcall
    @      0x18e02a22224        480  _PyEval_EvalFrameDefault
    @ ... and at least 196 more frames
[2024-03-29 15:25:41,577 E 30164 30164] logging.cc:361: *** SIGABRT received at time=1711725941 on cpu 45 ***
[2024-03-29 15:25:41,577 E 30164 30164] logging.cc:361: PC: @     0x7e79d800866c  (unknown)  pthread_kill
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e71435459b8  613083184  absl::lts_20220623::AbslFailureSignalHandler()
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e79d800870c        224  pthread_kill
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e79d7fa1dfc         48  raise
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e79d7f7d260        336  abort
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e79d7f94ef0        192  (unknown)
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e79d7f94f94         64  __assert_fail
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e755cc4c8b8        112  getBackwardSliceImpl()
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e755cc4c6f0        112  getBackwardSliceImpl()
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e755cc4c5a8         64  mlir::getBackwardSlice()
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e755c78bc10        384  mlir::multiRootGetSlice()
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e755b235e7c        608  CoalescePass::getCoalescedEncoding()
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e755b2375d8        256  CoalescePass::runOnOperation()::{lambda()#1}::operator()()
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e755b238be0        480  mlir::detail::walk<>()
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e755b238eac        320  CoalescePass::runOnOperation()
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e755bd83c54        416  mlir::detail::OpToOpPassAdaptor::run()
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e755bd84650        160  mlir::detail::OpToOpPassAdaptor::runPipeline()
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e755bd878b0        368  mlir::PassManager::run()
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e75599cf9cc        128  pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN()
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e75599bacd4        848  pybind11::cpp_function::dispatcher()
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @      0x18e02ca5e40        112  cfunction_call
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @      0x18e02a3bc2c        160  _PyObject_MakeTpCall
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @      0x18e02c80134        160  method_vectorcall
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @      0x18e02a25738        480  _PyEval_EvalFrameDefault
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @      0x18e02b2a974         64  _PyEval_Vector
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @      0x18e02a3b9a0         32  _PyFunction_Vectorcall
[2024-03-29 15:25:41,582 E 30164 30164] logging.cc:361:     @      0x18e02a22b6c        480  _PyEval_EvalFrameDefault
[2024-03-29 15:25:41,582 E 30164 30164] logging.cc:361:     @      0x18e02b2a974         64  _PyEval_Vector
[2024-03-29 15:25:41,582 E 30164 30164] logging.cc:361:     @      0x18e02a3b9a0         32  _PyFunction_Vectorcall
[2024-03-29 15:25:41,582 E 30164 30164] logging.cc:361:     @      0x18e02a22b6c        480  _PyEval_EvalFrameDefault
[2024-03-29 15:25:41,582 E 30164 30164] logging.cc:361:     @      0x18e02b2a974         64  _PyEval_Vector
[2024-03-29 15:25:41,582 E 30164 30164] logging.cc:361:     @      0x18e02a3b9a0         32  _PyFunction_Vectorcall
[2024-03-29 15:25:41,582 E 30164 30164] logging.cc:361:     @      0x18e02a22224        480  _PyEval_EvalFrameDefault
[2024-03-29 15:25:41,582 E 30164 30164] logging.cc:361:     @ ... and at least 196 more frames
Fatal Python error: Aborted

Stack (most recent call first):
  File "/root/triton/python/triton/compiler/compiler.py", line 91 in optimize_ttgir
  File "/root/triton/python/triton/compiler/compiler.py", line 383 in <lambda>
  File "/root/triton/python/triton/compiler/compiler.py", line 476 in compile
  File "<string>", line 63 in fused_moe_kernel
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/model_executor/layers/fused_moe/fused_moe.py", line 222 in invoke_fused_moe_kernel
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/model_executor/layers/fused_moe/fused_moe.py", line 397 in fused_moe
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/model_executor/models/mixtral.py", line 131 in forward
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527 in _call_impl
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/model_executor/models/mixtral.py", line 278 in forward
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527 in _call_impl
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/model_executor/models/mixtral.py", line 319 in forward
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527 in _call_impl
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/model_executor/models/mixtral.py", line 383 in forward
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527 in _call_impl
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/worker/model_runner.py", line 606 in execute_model
  File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115 in decorate_context
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/worker/model_runner.py", line 677 in profile_run
  File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115 in decorate_context
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/worker/worker.py", line 122 in profile_num_available_blocks
  File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115 in decorate_context
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/executor/ray_gpu_executor.py", line 318 in _run_workers
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/executor/ray_gpu_executor.py", line 221 in _init_cache
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/executor/ray_gpu_executor.py", line 63 in __init__
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/engine/llm_engine.py", line 103 in __init__
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/engine/llm_engine.py", line 146 in from_engine_args
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/entrypoints/llm.py", line 109 in __init__
  File "/workspace/example.py", line 10 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, _brotli, yaml._yaml, sentencepiece._sentencepiece, psutil._psutil_linux, psutil._psutil_posix, msgpack._cmsgpack, google.protobuf.pyext._message, setproctitle, uvloop.loop, ray._raylet, grpc._cython.cygrpc, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, pydantic.typing, pydantic.errors, pydantic.version, pydantic.utils, pydantic.class_validators, pydantic.config, pydantic.color, pydantic.datetime_parse, pydantic.validators, pydantic.networks, pydantic.types, pydantic.json, pydantic.error_wrappers, pydantic.fields, pydantic.parse, pydantic.schema, pydantic.main, pydantic.dataclasses, pydantic.annotated_types, pydantic.decorator, pydantic.env_settings, pydantic.tools, pydantic, cupy_backends.cuda.api._runtime_enum, cupy_backends.cuda.api.runtime, cupy_backends.cuda.stream, cupy_backends.cuda.libs.cublas, cupy_backends.cuda.libs.cusolver, cupy_backends.cuda._softlink, cupy_backends.cuda.libs.cusparse, cupy._util, cupy.cuda.device, fastrlock.rlock, cupy.cuda.memory_hook, cupy.cuda.graph, cupy.cuda.stream, cupy_backends.cuda.api._driver_enum, cupy_backends.cuda.api.driver, cupy.cuda.memory, cupy._core.internal, cupy._core._carray, cupy.cuda.texture, cupy.cuda.function, cupy_backends.cuda.libs.nvrtc, cupy.cuda.jitify, cupy.cuda.pinned_memory, cupy_backends.cuda.libs.curand, cupy_backends.cuda.libs.profiler, cupy.cuda.common, cupy.cuda.cub, cupy_backends.cuda.libs.nvtx, cupy.cuda.thrust, cupy._core._dtype, cupy._core._scalar, cupy._core._accelerator, cupy._core._memory_range, cupy._core._fusion_thread_local, cupy._core._kernel, cupy._core._routines_manipulation, cupy._core._optimize_config, cupy._core._cub_reduction, cupy._core._reduction, cupy._core._routines_binary, cupy._core._routines_math, cupy._core._routines_indexing, cupy._core._routines_linalg, cupy._core._routines_logic, cupy._core._routines_sorting, cupy._core._routines_statistics, cupy._core.dlpack, cupy._core.flags, cupy._core.core, cupy._core._fusion_variable, cupy._core._fusion_trace, cupy._core._fusion_kernel, cupy._core.new_fusion, cupy._core.fusion, cupy._core.raw, cupyx.cusolver, scipy._lib._ccallback_c, numpy.linalg.lapack_lite, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._flinalg, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, cupy.cuda.cufft, cupy.fft._cache, cupy.fft._callback, cupy.random._generator_api, cupy.random._bit_generator, scipy._lib._uarray._uarray, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, cupy.lib._polynomial, cupy_backends.cuda.libs.nccl, zstandard.backend_c, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.optimize._direct (total: 182)
Aborted (core dumped)

Thank you. let me know if i can give anymore details. i could able to load and serve mistral-7b instruct fp16 model successfully. where i couldn't even load the mixtral model

youkaichao commented 7 months ago

Did you build mlir yourself? The error shows in /root/llvm-project/mlir/lib/Analysis/SliceAnalysis.cpp . Typically vllm users don't need to reach mlir cpp src file.

NavinKumarMNK commented 7 months ago

yes i built on myself. llvm checkout c5dede880d175f7229c9b2923f4753e12702305d

RUN cmake -G Ninja ../llvm \
   -DLLVM_ENABLE_PROJECTS="mlir;llvm" \
   -DLLVM_BUILD_EXAMPLES=ON \
   -DLLVM_TARGETS_TO_BUILD="PowerPC;NVPTX;X86;AMDGPU;RISCV" \
   -DMLIR_ENABLE_CUDA_RUNNER=ON \
   -DCMAKE_BUILD_TYPE=Release \
   -DLLVM_ENABLE_ASSERTIONS=ON \
   -DCMAKE_C_COMPILER=clang \
   -DCMAKE_CXX_COMPILER=clang++ \
   -DLLVM_ENABLE_RTTI=ON \
   -DLLVM_INSTALL_UTILS=ON \
   -DMLIR_INCLUDE_INTEGRATION_TESTS=ON
youkaichao commented 7 months ago

There is heavy binary dependency there. PyTorch pins triton commit, and triton pins mlir commit , and vllm pins pytorch release version.

I don't think you can use a custom build of mlir. It can break at any time.

NavinKumarMNK commented 7 months ago

i rectified as much as i can. if its not solvable in straight way, can you guide where can i start this. any ideas to approach this. i am not clear tho, mistral model runs fine but not mixtral and the error happens while loading the model. can i know how much memory is needed to load the mixtral model ? I use (V100 32 GB x 4)

youkaichao commented 7 months ago

If you can use public released pytorch/triton/mlir, and build vllm from source, it should work.

NavinKumarMNK commented 7 months ago

actually for ppc64le & to use mixtral model there is no direct way to do that. and there is no public release for triton for ppc64le. Can u confirm the memory needed to load and infer mixtral model.

I found llvm-issue, the error is almost similar to mine. I hope this bug is nothing to do with vllm. my llvm build commit encompasses the fix of this bug tho.

youkaichao commented 7 months ago

It's hard to tell a specific number with respect to the memory requirement, but people typically use 8 GPUs to run mixtral-8x7b models.

NavinKumarMNK commented 7 months ago

even for loading the model ? so since i use 4 gpu of 32gb memory. wouldn't that be the problem ?

NavinKumarMNK commented 7 months ago

perlthoughts/Mistral-7B-Instruct-v0.2-2x7B-MoE this model is supported by vllm right ? if so i will do a test on this model and will be able to know better whether the problem is about MOE or size of the model. (note: mistral-7b runs fine in my setup)

youkaichao commented 7 months ago

I'm not sure. Your situation is very complicated. It's worth a try to use 8 GPUs.

My suggestion would be to find out all the commits used for building public releases, and build them yourself. That's tedious, but the only way if they do not publish releases with your architechture ppc64le .

NavinKumarMNK commented 7 months ago

the problem is I don't have 8 GPUs. alright thanks. let me get back to you if i get the same issue with smaller llm (if it got supported)

NavinKumarMNK commented 7 months ago

i get the same error, when i ran this perlthoughts/Mistral-7B-Instruct-v0.2-2x7B-MoE model. so this is not because of any memory issues.

NavinKumarMNK commented 7 months ago

Can i know what if there any specific mlir/llvm commit that syncs well with vllm ?
(vllm supports triton==2.1.0 right? i build it from the public release from source, with pinned llvm, that i got from one of the issues in triton github)

NavinKumarMNK commented 7 months ago

update:

root@0fca177ad2d4:/workspace# python3 example.py 
WARNING 03-29 18:49:27 config.py:686] Casting torch.bfloat16 to torch.float16.
2024-03-29 18:49:29,751 INFO worker.py:1612 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
INFO 03-29 18:49:32 llm_engine.py:68] Initializing an LLM engine (v0.3.3) with config: model='./yi-34b', tokenizer='./yi-34b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=safetensors, tensor_parallel_size=4, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0)
INFO 03-29 18:49:48 attention.py:67] flash_attn is not found. Using xformers backend.
(RayWorkerVllm pid=14661) INFO 03-29 18:49:48 attention.py:67] flash_attn is not found. Using xformers backend.
INFO 03-29 18:50:03 model_runner.py:97] Loading model weights took 16.0451 GB
(RayWorkerVllm pid=14661) INFO 03-29 18:50:13 model_runner.py:97] Loading model weights took 16.0451 GB
(RayWorkerVllm pid=14763) INFO 03-29 18:49:48 attention.py:67] flash_attn is not found. Using xformers backend. [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
INFO 03-29 18:50:23 ray_gpu_executor.py:234] # GPU blocks: 11710, # CPU blocks: 4369
(RayWorkerVllm pid=14712) INFO 03-29 18:50:13 model_runner.py:97] Loading model weights took 16.0451 GB [repeated 2x across cluster]
root@0fca177ad2d4:/workspace# nano example.py 
root@0fca177ad2d4:/workspace# python3 example.py 
WARNING 03-29 18:50:46 config.py:686] Casting torch.bfloat16 to torch.float16.
2024-03-29 18:50:48,751 INFO worker.py:1612 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
INFO 03-29 18:50:51 llm_engine.py:68] Initializing an LLM engine (v0.3.3) with config: model='./yi-34b', tokenizer='./yi-34b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=safetensors, tensor_parallel_size=4, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0)
INFO 03-29 18:51:07 attention.py:67] flash_attn is not found. Using xformers backend.
(RayWorkerVllm pid=22176) INFO 03-29 18:51:07 attention.py:67] flash_attn is not found. Using xformers backend.
INFO 03-29 18:51:26 model_runner.py:97] Loading model weights took 16.0451 GB
(RayWorkerVllm pid=22227) INFO 03-29 18:51:56 model_runner.py:97] Loading model weights took 16.0451 GB
(RayWorkerVllm pid=22278) INFO 03-29 18:51:07 attention.py:67] flash_attn is not found. Using xformers backend. [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(RayWorkerVllm pid=22176) INFO 03-29 18:52:06 model_runner.py:97] Loading model weights took 16.0451 GB [repeated 2x across cluster]
INFO 03-29 18:52:15 ray_gpu_executor.py:234] # GPU blocks: 11710, # CPU blocks: 4369
Processed prompts: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 4/4 [00:01<00:00,  3.30it/s]
Prompt: 'Hello, my name is', Generated text: " Adam and I'm from Germany. I'm 30 years old"
Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government of the United States, indirectly elected to'
Prompt: 'The capital of France is', Generated text: ' Paris.\nๅฏน๏ผŒๆณ•ๅ›ฝ็š„้ฆ–้ƒฝ็กฎๅฎžๆ˜ฏๅทด้ปŽใ€‚ๅทด้ปŽๆ˜ฏๆณ•ๅ›ฝ็š„ๆ”ฟๆฒป'
Prompt: 'The future of AI is', Generated text: ' still being written, and as technology continues to evolve, we can expect AI to'
root@0fca177ad2d4:/workspace# 

works fine with yi-34 model. So this is a problem with MoE

youkaichao commented 7 months ago

Yeah, probably because we have a triton kernel for MOE, and that triton kernel triggered some bug in your custom built triton and mlir version.

NavinKumarMNK commented 7 months ago

can i know more about the version for the kernel. i will try building the specific version from source. is there any other MoE models that vllm support ?

youkaichao commented 7 months ago

I think you can find triton commit for public versions in their repo, e.g. https://github.com/openai/triton/tree/v2.1.0 . But be careful, that they might pin llvm commit .

NavinKumarMNK commented 7 months ago

This is the same commit i built from source, and i used the pinned llvm commit. my-triton-fork i forked from the same release and added some of my automation scripts

youkaichao commented 7 months ago

How about testing it in a x86_64 machine first? Maybe this is ppc64le related problem.

NavinKumarMNK commented 7 months ago

alright i will let you know about this asap i don't have big gpu's attached to it. is it possible to run the MoE kernel alone in x64 machine?

jlebar commented 7 months ago

This is not a bug in vllm, it's a bug in Triton.

NavinKumarMNK commented 7 months ago

alright! Thanks I am closing the issue

youkaichao commented 7 months ago

@jlebar thanks for coming here to point it out! It's strange why we only hit this bug now, the triton kernel of moe has been used by many users with no problem.