llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.87k stars 11.93k forks source link

Assertion `parentOp->getNumRegions() == 1 && parentOp->getRegion(0).getBlocks().size() == 1' failed #87089

Closed NavinKumarMNK closed 7 months ago

NavinKumarMNK commented 7 months ago

Your current environment

root@0fca177ad2d4:/workspace# python3 collect_env.py 
Collecting environment information...
PyTorch version: 2.1.2
Is debug build: False
CUDA used to build PyTorch: 12.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (ppc64le)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.0
Libc version: glibc-2.35

Python version: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 16:04:32) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-100-generic-ppc64le-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.2.91
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB
GPU 2: Tesla V100-SXM2-32GB
GPU 3: Tesla V100-SXM2-32GB

Nvidia driver version: 535.161.07
cuDNN version: Probably one of the following:
/usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn.so.8.9.5
/usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_adv_infer.so.8.9.5
/usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_adv_train.so.8.9.5
/usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_cnn_infer.so.8.9.5
/usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_cnn_train.so.8.9.5
/usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_ops_infer.so.8.9.5
/usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_ops_train.so.8.9.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: False

CPU:
Architecture:                       ppc64le
Byte Order:                         Little Endian
CPU(s):                             128
On-line CPU(s) list:                0-127
Model name:                         POWER9, altivec supported
Model:                              2.2 (pvr 004e 1202)
Thread(s) per core:                 4
Core(s) per socket:                 16
Socket(s):                          2
Frequency boost:                    enabled
CPU max MHz:                        3800.0000
CPU min MHz:                        2300.0000
L1d cache:                          1 MiB (32 instances)
L1i cache:                          1 MiB (32 instances)
L2 cache:                           8 MiB (16 instances)
L3 cache:                           160 MiB (16 instances)
NUMA node(s):                       6
NUMA node0 CPU(s):                  0-63
NUMA node8 CPU(s):                  64-127
NUMA node252 CPU(s):                
NUMA node253 CPU(s):                
NUMA node254 CPU(s):                
NUMA node255 CPU(s):                
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Mitigation; RFI Flush, L1D private per thread
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Mitigation; RFI Flush, L1D private per thread
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Kernel entry/exit barrier (eieio)
Vulnerability Spectre v1:           Mitigation; __user pointer sanitization, ori31 speculation barrier enabled
Vulnerability Spectre v2:           Mitigation; Indirect branch serialisation (kernel only)
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.24.3
[pip3] torch==2.1.2
[conda] cudatoolkit               11.8.0              hedcfb66_13    conda-forge
[conda] libmagma                  2.7.2                he288b6c_2    conda-forge
[conda] libmagma_sparse           2.7.2                h5b5c57a_3    conda-forge
[conda] magma                     2.7.2                h097a1ca_3    conda-forge
[conda] numpy                     1.24.3          py310h87cc683_0  
[conda] numpy-base                1.24.3          py310hac71eb6_0  
[conda] torch                     2.1.2                     dev_0    <develop>ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.3.3
vLLM Build Flags:
CUDA Archs: 7.0; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0  X  NV3 SYS SYS 0-63 0  N/A
GPU1 NV3  X  SYS SYS 0-63 0  N/A
GPU2 SYS SYS  X  NV3 64-127 8   N/A
GPU3 SYS SYS NV3  X  64-127 8   N/A

Legend:
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
$ llvm-config --version
17.0.0git

llvm build commit : c5dede880d175f7229c9b2923f4753e12702305d

build command

cmake -G Ninja ../llvm \
   -DLLVM_ENABLE_PROJECTS="mlir;llvm" \
   -DLLVM_BUILD_EXAMPLES=ON \
   -DLLVM_TARGETS_TO_BUILD="PowerPC;NVPTX;X86;AMDGPU;RISCV" \
   -DMLIR_ENABLE_CUDA_RUNNER=ON \
   -DCMAKE_BUILD_TYPE=Release \
   -DLLVM_ENABLE_ASSERTIONS=ON \
   -DCMAKE_C_COMPILER=clang \
   -DCMAKE_CXX_COMPILER=clang++ \
   -DLLVM_ENABLE_RTTI=ON \
   -DLLVM_INSTALL_UTILS=ON \
   -DMLIR_INCLUDE_INTEGRATION_TESTS=ON

Bug

example.py. - i loaded the mixtral-8x7b-instruct fp16 model

from vllm import LLM, SamplingParams
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(
    model="./models", 
    dtype="float16", 
    tensor_parallel_size=4, 
    enforce_eager=True, 
    trust_remote_code=True, 
    load_format='safetensors',
    # quantization="AWQ",
)
root@0fca177ad2d4:/workspace# python3 example.py 
WARNING 03-29 15:24:46 config.py:686] Casting torch.bfloat16 to torch.float16.
2024-03-29 15:24:48,678 INFO worker.py:1612 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
INFO 03-29 15:24:52 llm_engine.py:68] Initializing an LLM engine (v0.3.3) with config: model='./models', tokenizer='./models', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=safetensors, tensor_parallel_size=4, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0)
INFO 03-29 15:25:07 attention.py:67] flash_attn is not found. Using xformers backend.
(RayWorkerVllm pid=37294) INFO 03-29 15:25:07 attention.py:67] flash_attn is not found. Using xformers backend.
INFO 03-29 15:25:27 model_runner.py:97] Loading model weights took 21.7573 GB
(RayWorkerVllm pid=37294) INFO 03-29 15:25:39 model_runner.py:97] Loading model weights took 21.7573 GB
(RayWorkerVllm pid=37345) INFO 03-29 15:25:07 attention.py:67] flash_attn is not found. Using xformers backend. [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
python3: /root/llvm-project/mlir/lib/Analysis/SliceAnalysis.cpp:106: void getBackwardSliceImpl(mlir::Operation *, SetVector<mlir::Operation *> *, mlir::TransitiveFilter): Assertion `parentOp->getNumRegions() == 1 && parentOp->getRegion(0).getBlocks().size() == 1' failed.
*** SIGABRT received at time=1711725941 on cpu 45 ***
PC: @     0x7e79d800866c  (unknown)  pthread_kill
    @     0x7e7143545984  613083184  absl::lts_20220623::AbslFailureSignalHandler()
    @     0x7e79d800870c        224  pthread_kill
    @     0x7e79d7fa1dfc         48  raise
    @     0x7e79d7f7d260        336  abort
    @     0x7e79d7f94ef0        192  (unknown)
    @     0x7e79d7f94f94         64  __assert_fail
    @     0x7e755cc4c8b8        112  getBackwardSliceImpl()
    @     0x7e755cc4c6f0        112  getBackwardSliceImpl()
    @     0x7e755cc4c5a8         64  mlir::getBackwardSlice()
    @     0x7e755c78bc10        384  mlir::multiRootGetSlice()
    @     0x7e755b235e7c        608  CoalescePass::getCoalescedEncoding()
    @     0x7e755b2375d8        256  CoalescePass::runOnOperation()::{lambda()#1}::operator()()
    @     0x7e755b238be0        480  mlir::detail::walk<>()
    @     0x7e755b238eac        320  CoalescePass::runOnOperation()
    @     0x7e755bd83c54        416  mlir::detail::OpToOpPassAdaptor::run()
    @     0x7e755bd84650        160  mlir::detail::OpToOpPassAdaptor::runPipeline()
    @     0x7e755bd878b0        368  mlir::PassManager::run()
    @     0x7e75599cf9cc        128  pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN()
    @     0x7e75599bacd4        848  pybind11::cpp_function::dispatcher()
    @      0x18e02ca5e40        112  cfunction_call
    @      0x18e02a3bc2c        160  _PyObject_MakeTpCall
    @      0x18e02c80134        160  method_vectorcall
    @      0x18e02a25738        480  _PyEval_EvalFrameDefault
    @      0x18e02b2a974         64  _PyEval_Vector
    @      0x18e02a3b9a0         32  _PyFunction_Vectorcall
    @      0x18e02a22b6c        480  _PyEval_EvalFrameDefault
    @      0x18e02b2a974         64  _PyEval_Vector
    @      0x18e02a3b9a0         32  _PyFunction_Vectorcall
    @      0x18e02a22b6c        480  _PyEval_EvalFrameDefault
    @      0x18e02b2a974         64  _PyEval_Vector
    @      0x18e02a3b9a0         32  _PyFunction_Vectorcall
    @      0x18e02a22224        480  _PyEval_EvalFrameDefault
    @ ... and at least 196 more frames
[2024-03-29 15:25:41,577 E 30164 30164] logging.cc:361: *** SIGABRT received at time=1711725941 on cpu 45 ***
[2024-03-29 15:25:41,577 E 30164 30164] logging.cc:361: PC: @     0x7e79d800866c  (unknown)  pthread_kill
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e71435459b8  613083184  absl::lts_20220623::AbslFailureSignalHandler()
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e79d800870c        224  pthread_kill
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e79d7fa1dfc         48  raise
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e79d7f7d260        336  abort
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e79d7f94ef0        192  (unknown)
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e79d7f94f94         64  __assert_fail
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e755cc4c8b8        112  getBackwardSliceImpl()
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e755cc4c6f0        112  getBackwardSliceImpl()
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e755cc4c5a8         64  mlir::getBackwardSlice()
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e755c78bc10        384  mlir::multiRootGetSlice()
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e755b235e7c        608  CoalescePass::getCoalescedEncoding()
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e755b2375d8        256  CoalescePass::runOnOperation()::{lambda()#1}::operator()()
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e755b238be0        480  mlir::detail::walk<>()
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e755b238eac        320  CoalescePass::runOnOperation()
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e755bd83c54        416  mlir::detail::OpToOpPassAdaptor::run()
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e755bd84650        160  mlir::detail::OpToOpPassAdaptor::runPipeline()
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e755bd878b0        368  mlir::PassManager::run()
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e75599cf9cc        128  pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN()
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @     0x7e75599bacd4        848  pybind11::cpp_function::dispatcher()
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @      0x18e02ca5e40        112  cfunction_call
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @      0x18e02a3bc2c        160  _PyObject_MakeTpCall
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @      0x18e02c80134        160  method_vectorcall
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @      0x18e02a25738        480  _PyEval_EvalFrameDefault
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @      0x18e02b2a974         64  _PyEval_Vector
[2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361:     @      0x18e02a3b9a0         32  _PyFunction_Vectorcall
[2024-03-29 15:25:41,582 E 30164 30164] logging.cc:361:     @      0x18e02a22b6c        480  _PyEval_EvalFrameDefault
[2024-03-29 15:25:41,582 E 30164 30164] logging.cc:361:     @      0x18e02b2a974         64  _PyEval_Vector
[2024-03-29 15:25:41,582 E 30164 30164] logging.cc:361:     @      0x18e02a3b9a0         32  _PyFunction_Vectorcall
[2024-03-29 15:25:41,582 E 30164 30164] logging.cc:361:     @      0x18e02a22b6c        480  _PyEval_EvalFrameDefault
[2024-03-29 15:25:41,582 E 30164 30164] logging.cc:361:     @      0x18e02b2a974         64  _PyEval_Vector
[2024-03-29 15:25:41,582 E 30164 30164] logging.cc:361:     @      0x18e02a3b9a0         32  _PyFunction_Vectorcall
[2024-03-29 15:25:41,582 E 30164 30164] logging.cc:361:     @      0x18e02a22224        480  _PyEval_EvalFrameDefault
[2024-03-29 15:25:41,582 E 30164 30164] logging.cc:361:     @ ... and at least 196 more frames
Fatal Python error: Aborted

Stack (most recent call first):
  File "/root/triton/python/triton/compiler/compiler.py", line 91 in optimize_ttgir
  File "/root/triton/python/triton/compiler/compiler.py", line 383 in <lambda>
  File "/root/triton/python/triton/compiler/compiler.py", line 476 in compile
  File "<string>", line 63 in fused_moe_kernel
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/model_executor/layers/fused_moe/fused_moe.py", line 222 in invoke_fused_moe_kernel
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/model_executor/layers/fused_moe/fused_moe.py", line 397 in fused_moe
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/model_executor/models/mixtral.py", line 131 in forward
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527 in _call_impl
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/model_executor/models/mixtral.py", line 278 in forward
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527 in _call_impl
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/model_executor/models/mixtral.py", line 319 in forward
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527 in _call_impl
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/model_executor/models/mixtral.py", line 383 in forward
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527 in _call_impl
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/worker/model_runner.py", line 606 in execute_model
  File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115 in decorate_context
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/worker/model_runner.py", line 677 in profile_run
  File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115 in decorate_context
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/worker/worker.py", line 122 in profile_num_available_blocks
  File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115 in decorate_context
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/executor/ray_gpu_executor.py", line 318 in _run_workers
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/executor/ray_gpu_executor.py", line 221 in _init_cache
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/executor/ray_gpu_executor.py", line 63 in __init__
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/engine/llm_engine.py", line 103 in __init__
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/engine/llm_engine.py", line 146 in from_engine_args
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/entrypoints/llm.py", line 109 in __init__
  File "/workspace/example.py", line 10 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, _brotli, yaml._yaml, sentencepiece._sentencepiece, psutil._psutil_linux, psutil._psutil_posix, msgpack._cmsgpack, google.protobuf.pyext._message, setproctitle, uvloop.loop, ray._raylet, grpc._cython.cygrpc, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, pydantic.typing, pydantic.errors, pydantic.version, pydantic.utils, pydantic.class_validators, pydantic.config, pydantic.color, pydantic.datetime_parse, pydantic.validators, pydantic.networks, pydantic.types, pydantic.json, pydantic.error_wrappers, pydantic.fields, pydantic.parse, pydantic.schema, pydantic.main, pydantic.dataclasses, pydantic.annotated_types, pydantic.decorator, pydantic.env_settings, pydantic.tools, pydantic, cupy_backends.cuda.api._runtime_enum, cupy_backends.cuda.api.runtime, cupy_backends.cuda.stream, cupy_backends.cuda.libs.cublas, cupy_backends.cuda.libs.cusolver, cupy_backends.cuda._softlink, cupy_backends.cuda.libs.cusparse, cupy._util, cupy.cuda.device, fastrlock.rlock, cupy.cuda.memory_hook, cupy.cuda.graph, cupy.cuda.stream, cupy_backends.cuda.api._driver_enum, cupy_backends.cuda.api.driver, cupy.cuda.memory, cupy._core.internal, cupy._core._carray, cupy.cuda.texture, cupy.cuda.function, cupy_backends.cuda.libs.nvrtc, cupy.cuda.jitify, cupy.cuda.pinned_memory, cupy_backends.cuda.libs.curand, cupy_backends.cuda.libs.profiler, cupy.cuda.common, cupy.cuda.cub, cupy_backends.cuda.libs.nvtx, cupy.cuda.thrust, cupy._core._dtype, cupy._core._scalar, cupy._core._accelerator, cupy._core._memory_range, cupy._core._fusion_thread_local, cupy._core._kernel, cupy._core._routines_manipulation, cupy._core._optimize_config, cupy._core._cub_reduction, cupy._core._reduction, cupy._core._routines_binary, cupy._core._routines_math, cupy._core._routines_indexing, cupy._core._routines_linalg, cupy._core._routines_logic, cupy._core._routines_sorting, cupy._core._routines_statistics, cupy._core.dlpack, cupy._core.flags, cupy._core.core, cupy._core._fusion_variable, cupy._core._fusion_trace, cupy._core._fusion_kernel, cupy._core.new_fusion, cupy._core.fusion, cupy._core.raw, cupyx.cusolver, scipy._lib._ccallback_c, numpy.linalg.lapack_lite, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._flinalg, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, cupy.cuda.cufft, cupy.fft._cache, cupy.fft._callback, cupy.random._generator_api, cupy.random._bit_generator, scipy._lib._uarray._uarray, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, cupy.lib._polynomial, cupy_backends.cuda.libs.nccl, zstandard.backend_c, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.optimize._direct (total: 182)
Aborted (core dumped)

Thank you. let me know if i can give anymore details.

llvmbot commented 7 months ago

@llvm/issue-subscribers-mlir

Author: Navin Kumar M (NavinKumarMNK)

### Your current environment ```text root@0fca177ad2d4:/workspace# python3 collect_env.py Collecting environment information... PyTorch version: 2.1.2 Is debug build: False CUDA used to build PyTorch: 12.2 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.4 LTS (ppc64le) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.29.0 Libc version: glibc-2.35 Python version: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 16:04:32) [GCC 12.3.0] (64-bit runtime) Python platform: Linux-5.15.0-100-generic-ppc64le-with-glibc2.35 Is CUDA available: True CUDA runtime version: 12.2.91 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: Tesla V100-SXM2-32GB GPU 1: Tesla V100-SXM2-32GB GPU 2: Tesla V100-SXM2-32GB GPU 3: Tesla V100-SXM2-32GB Nvidia driver version: 535.161.07 cuDNN version: Probably one of the following: /usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn.so.8.9.5 /usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_adv_infer.so.8.9.5 /usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_adv_train.so.8.9.5 /usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_cnn_infer.so.8.9.5 /usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_cnn_train.so.8.9.5 /usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_ops_infer.so.8.9.5 /usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_ops_train.so.8.9.5 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: False CPU: Architecture: ppc64le Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Model name: POWER9, altivec supported Model: 2.2 (pvr 004e 1202) Thread(s) per core: 4 Core(s) per socket: 16 Socket(s): 2 Frequency boost: enabled CPU max MHz: 3800.0000 CPU min MHz: 2300.0000 L1d cache: 1 MiB (32 instances) L1i cache: 1 MiB (32 instances) L2 cache: 8 MiB (16 instances) L3 cache: 160 MiB (16 instances) NUMA node(s): 6 NUMA node0 CPU(s): 0-63 NUMA node8 CPU(s): 64-127 NUMA node252 CPU(s): NUMA node253 CPU(s): NUMA node254 CPU(s): NUMA node255 CPU(s): Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Mitigation; RFI Flush, L1D private per thread Vulnerability Mds: Not affected Vulnerability Meltdown: Mitigation; RFI Flush, L1D private per thread Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Kernel entry/exit barrier (eieio) Vulnerability Spectre v1: Mitigation; __user pointer sanitization, ori31 speculation barrier enabled Vulnerability Spectre v2: Mitigation; Indirect branch serialisation (kernel only) Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==1.24.3 [pip3] torch==2.1.2 [conda] cudatoolkit 11.8.0 hedcfb66_13 conda-forge [conda] libmagma 2.7.2 he288b6c_2 conda-forge [conda] libmagma_sparse 2.7.2 h5b5c57a_3 conda-forge [conda] magma 2.7.2 h097a1ca_3 conda-forge [conda] numpy 1.24.3 py310h87cc683_0 [conda] numpy-base 1.24.3 py310hac71eb6_0 [conda] torch 2.1.2 dev_0 <develop>ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.3.3 vLLM Build Flags: CUDA Archs: 7.0; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV3 SYS SYS 0-63 0 N/A GPU1 NV3 X SYS SYS 0-63 0 N/A GPU2 SYS SYS X NV3 64-127 8 N/A GPU3 SYS SYS NV3 X 64-127 8 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ``` ```bash $ llvm-config --version 17.0.0git ``` llvm build commit : c5dede880d175f7229c9b2923f4753e12702305d build command ```bash cmake -G Ninja ../llvm \ -DLLVM_ENABLE_PROJECTS="mlir;llvm" \ -DLLVM_BUILD_EXAMPLES=ON \ -DLLVM_TARGETS_TO_BUILD="PowerPC;NVPTX;X86;AMDGPU;RISCV" \ -DMLIR_ENABLE_CUDA_RUNNER=ON \ -DCMAKE_BUILD_TYPE=Release \ -DLLVM_ENABLE_ASSERTIONS=ON \ -DCMAKE_C_COMPILER=clang \ -DCMAKE_CXX_COMPILER=clang++ \ -DLLVM_ENABLE_RTTI=ON \ -DLLVM_INSTALL_UTILS=ON \ -DMLIR_INCLUDE_INTEGRATION_TESTS=ON ``` ### Bug example.py. - i loaded the mixtral-8x7b-instruct fp16 model ```python from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) llm = LLM( model="./models", dtype="float16", tensor_parallel_size=4, enforce_eager=True, trust_remote_code=True, load_format='safetensors', # quantization="AWQ", ) ``` ```bash root@0fca177ad2d4:/workspace# python3 example.py WARNING 03-29 15:24:46 config.py:686] Casting torch.bfloat16 to torch.float16. 2024-03-29 15:24:48,678 INFO worker.py:1612 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 INFO 03-29 15:24:52 llm_engine.py:68] Initializing an LLM engine (v0.3.3) with config: model='./models', tokenizer='./models', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=safetensors, tensor_parallel_size=4, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0) INFO 03-29 15:25:07 attention.py:67] flash_attn is not found. Using xformers backend. (RayWorkerVllm pid=37294) INFO 03-29 15:25:07 attention.py:67] flash_attn is not found. Using xformers backend. INFO 03-29 15:25:27 model_runner.py:97] Loading model weights took 21.7573 GB (RayWorkerVllm pid=37294) INFO 03-29 15:25:39 model_runner.py:97] Loading model weights took 21.7573 GB (RayWorkerVllm pid=37345) INFO 03-29 15:25:07 attention.py:67] flash_attn is not found. Using xformers backend. [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.) python3: /root/llvm-project/mlir/lib/Analysis/SliceAnalysis.cpp:106: void getBackwardSliceImpl(mlir::Operation *, SetVector<mlir::Operation *> *, mlir::TransitiveFilter): Assertion `parentOp->getNumRegions() == 1 && parentOp->getRegion(0).getBlocks().size() == 1' failed. *** SIGABRT received at time=1711725941 on cpu 45 *** PC: @ 0x7e79d800866c (unknown) pthread_kill @ 0x7e7143545984 613083184 absl::lts_20220623::AbslFailureSignalHandler() @ 0x7e79d800870c 224 pthread_kill @ 0x7e79d7fa1dfc 48 raise @ 0x7e79d7f7d260 336 abort @ 0x7e79d7f94ef0 192 (unknown) @ 0x7e79d7f94f94 64 __assert_fail @ 0x7e755cc4c8b8 112 getBackwardSliceImpl() @ 0x7e755cc4c6f0 112 getBackwardSliceImpl() @ 0x7e755cc4c5a8 64 mlir::getBackwardSlice() @ 0x7e755c78bc10 384 mlir::multiRootGetSlice() @ 0x7e755b235e7c 608 CoalescePass::getCoalescedEncoding() @ 0x7e755b2375d8 256 CoalescePass::runOnOperation()::{lambda()#1}::operator()() @ 0x7e755b238be0 480 mlir::detail::walk<>() @ 0x7e755b238eac 320 CoalescePass::runOnOperation() @ 0x7e755bd83c54 416 mlir::detail::OpToOpPassAdaptor::run() @ 0x7e755bd84650 160 mlir::detail::OpToOpPassAdaptor::runPipeline() @ 0x7e755bd878b0 368 mlir::PassManager::run() @ 0x7e75599cf9cc 128 pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN() @ 0x7e75599bacd4 848 pybind11::cpp_function::dispatcher() @ 0x18e02ca5e40 112 cfunction_call @ 0x18e02a3bc2c 160 _PyObject_MakeTpCall @ 0x18e02c80134 160 method_vectorcall @ 0x18e02a25738 480 _PyEval_EvalFrameDefault @ 0x18e02b2a974 64 _PyEval_Vector @ 0x18e02a3b9a0 32 _PyFunction_Vectorcall @ 0x18e02a22b6c 480 _PyEval_EvalFrameDefault @ 0x18e02b2a974 64 _PyEval_Vector @ 0x18e02a3b9a0 32 _PyFunction_Vectorcall @ 0x18e02a22b6c 480 _PyEval_EvalFrameDefault @ 0x18e02b2a974 64 _PyEval_Vector @ 0x18e02a3b9a0 32 _PyFunction_Vectorcall @ 0x18e02a22224 480 _PyEval_EvalFrameDefault @ ... and at least 196 more frames [2024-03-29 15:25:41,577 E 30164 30164] logging.cc:361: *** SIGABRT received at time=1711725941 on cpu 45 *** [2024-03-29 15:25:41,577 E 30164 30164] logging.cc:361: PC: @ 0x7e79d800866c (unknown) pthread_kill [2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361: @ 0x7e71435459b8 613083184 absl::lts_20220623::AbslFailureSignalHandler() [2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361: @ 0x7e79d800870c 224 pthread_kill [2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361: @ 0x7e79d7fa1dfc 48 raise [2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361: @ 0x7e79d7f7d260 336 abort [2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361: @ 0x7e79d7f94ef0 192 (unknown) [2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361: @ 0x7e79d7f94f94 64 __assert_fail [2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361: @ 0x7e755cc4c8b8 112 getBackwardSliceImpl() [2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361: @ 0x7e755cc4c6f0 112 getBackwardSliceImpl() [2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361: @ 0x7e755cc4c5a8 64 mlir::getBackwardSlice() [2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361: @ 0x7e755c78bc10 384 mlir::multiRootGetSlice() [2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361: @ 0x7e755b235e7c 608 CoalescePass::getCoalescedEncoding() [2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361: @ 0x7e755b2375d8 256 CoalescePass::runOnOperation()::{lambda()#1}::operator()() [2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361: @ 0x7e755b238be0 480 mlir::detail::walk<>() [2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361: @ 0x7e755b238eac 320 CoalescePass::runOnOperation() [2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361: @ 0x7e755bd83c54 416 mlir::detail::OpToOpPassAdaptor::run() [2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361: @ 0x7e755bd84650 160 mlir::detail::OpToOpPassAdaptor::runPipeline() [2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361: @ 0x7e755bd878b0 368 mlir::PassManager::run() [2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361: @ 0x7e75599cf9cc 128 pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN() [2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361: @ 0x7e75599bacd4 848 pybind11::cpp_function::dispatcher() [2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361: @ 0x18e02ca5e40 112 cfunction_call [2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361: @ 0x18e02a3bc2c 160 _PyObject_MakeTpCall [2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361: @ 0x18e02c80134 160 method_vectorcall [2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361: @ 0x18e02a25738 480 _PyEval_EvalFrameDefault [2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361: @ 0x18e02b2a974 64 _PyEval_Vector [2024-03-29 15:25:41,581 E 30164 30164] logging.cc:361: @ 0x18e02a3b9a0 32 _PyFunction_Vectorcall [2024-03-29 15:25:41,582 E 30164 30164] logging.cc:361: @ 0x18e02a22b6c 480 _PyEval_EvalFrameDefault [2024-03-29 15:25:41,582 E 30164 30164] logging.cc:361: @ 0x18e02b2a974 64 _PyEval_Vector [2024-03-29 15:25:41,582 E 30164 30164] logging.cc:361: @ 0x18e02a3b9a0 32 _PyFunction_Vectorcall [2024-03-29 15:25:41,582 E 30164 30164] logging.cc:361: @ 0x18e02a22b6c 480 _PyEval_EvalFrameDefault [2024-03-29 15:25:41,582 E 30164 30164] logging.cc:361: @ 0x18e02b2a974 64 _PyEval_Vector [2024-03-29 15:25:41,582 E 30164 30164] logging.cc:361: @ 0x18e02a3b9a0 32 _PyFunction_Vectorcall [2024-03-29 15:25:41,582 E 30164 30164] logging.cc:361: @ 0x18e02a22224 480 _PyEval_EvalFrameDefault [2024-03-29 15:25:41,582 E 30164 30164] logging.cc:361: @ ... and at least 196 more frames Fatal Python error: Aborted Stack (most recent call first): File "/root/triton/python/triton/compiler/compiler.py", line 91 in optimize_ttgir File "/root/triton/python/triton/compiler/compiler.py", line 383 in <lambda> File "/root/triton/python/triton/compiler/compiler.py", line 476 in compile File "<string>", line 63 in fused_moe_kernel File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/model_executor/layers/fused_moe/fused_moe.py", line 222 in invoke_fused_moe_kernel File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/model_executor/layers/fused_moe/fused_moe.py", line 397 in fused_moe File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/model_executor/models/mixtral.py", line 131 in forward File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527 in _call_impl File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/model_executor/models/mixtral.py", line 278 in forward File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527 in _call_impl File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/model_executor/models/mixtral.py", line 319 in forward File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527 in _call_impl File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/model_executor/models/mixtral.py", line 383 in forward File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527 in _call_impl File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/worker/model_runner.py", line 606 in execute_model File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115 in decorate_context File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/worker/model_runner.py", line 677 in profile_run File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115 in decorate_context File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/worker/worker.py", line 122 in profile_num_available_blocks File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115 in decorate_context File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/executor/ray_gpu_executor.py", line 318 in _run_workers File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/executor/ray_gpu_executor.py", line 221 in _init_cache File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/executor/ray_gpu_executor.py", line 63 in __init__ File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/engine/llm_engine.py", line 103 in __init__ File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/engine/llm_engine.py", line 146 in from_engine_args File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/entrypoints/llm.py", line 109 in __init__ File "/workspace/example.py", line 10 in <module> Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, _brotli, yaml._yaml, sentencepiece._sentencepiece, psutil._psutil_linux, psutil._psutil_posix, msgpack._cmsgpack, google.protobuf.pyext._message, setproctitle, uvloop.loop, ray._raylet, grpc._cython.cygrpc, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, pydantic.typing, pydantic.errors, pydantic.version, pydantic.utils, pydantic.class_validators, pydantic.config, pydantic.color, pydantic.datetime_parse, pydantic.validators, pydantic.networks, pydantic.types, pydantic.json, pydantic.error_wrappers, pydantic.fields, pydantic.parse, pydantic.schema, pydantic.main, pydantic.dataclasses, pydantic.annotated_types, pydantic.decorator, pydantic.env_settings, pydantic.tools, pydantic, cupy_backends.cuda.api._runtime_enum, cupy_backends.cuda.api.runtime, cupy_backends.cuda.stream, cupy_backends.cuda.libs.cublas, cupy_backends.cuda.libs.cusolver, cupy_backends.cuda._softlink, cupy_backends.cuda.libs.cusparse, cupy._util, cupy.cuda.device, fastrlock.rlock, cupy.cuda.memory_hook, cupy.cuda.graph, cupy.cuda.stream, cupy_backends.cuda.api._driver_enum, cupy_backends.cuda.api.driver, cupy.cuda.memory, cupy._core.internal, cupy._core._carray, cupy.cuda.texture, cupy.cuda.function, cupy_backends.cuda.libs.nvrtc, cupy.cuda.jitify, cupy.cuda.pinned_memory, cupy_backends.cuda.libs.curand, cupy_backends.cuda.libs.profiler, cupy.cuda.common, cupy.cuda.cub, cupy_backends.cuda.libs.nvtx, cupy.cuda.thrust, cupy._core._dtype, cupy._core._scalar, cupy._core._accelerator, cupy._core._memory_range, cupy._core._fusion_thread_local, cupy._core._kernel, cupy._core._routines_manipulation, cupy._core._optimize_config, cupy._core._cub_reduction, cupy._core._reduction, cupy._core._routines_binary, cupy._core._routines_math, cupy._core._routines_indexing, cupy._core._routines_linalg, cupy._core._routines_logic, cupy._core._routines_sorting, cupy._core._routines_statistics, cupy._core.dlpack, cupy._core.flags, cupy._core.core, cupy._core._fusion_variable, cupy._core._fusion_trace, cupy._core._fusion_kernel, cupy._core.new_fusion, cupy._core.fusion, cupy._core.raw, cupyx.cusolver, scipy._lib._ccallback_c, numpy.linalg.lapack_lite, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._flinalg, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, cupy.cuda.cufft, cupy.fft._cache, cupy.fft._callback, cupy.random._generator_api, cupy.random._bit_generator, scipy._lib._uarray._uarray, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, cupy.lib._polynomial, cupy_backends.cuda.libs.nccl, zstandard.backend_c, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.optimize._direct (total: 182) Aborted (core dumped) ``` Thank you. let me know if i can give anymore details.
Sirraide commented 7 months ago

Looks like you’re using LLVM 17. Have you tried using a more up-to-date version of LLVM (18 or 19), if that’s possible for you?

NavinKumarMNK commented 7 months ago

Currently triton==2.1.0 has pinned this commit. if this issue is solved in any version. i could try installing from source. Can i know why this happens. and possible solutions. like an overview

NavinKumarMNK commented 7 months ago

for more context, i am adding the same issue that i raised in vllm https://github.com/vllm-project/vllm/issues/3732

joker-eph commented 7 months ago

Have you reported this to Triton?

NavinKumarMNK commented 7 months ago

not yet. if it will will help you, i can do it

(I did some experiment in vllm and found out the error is occurring in the Mixture of Experts kernel used in vlllm. Just mentioning it might be useful.)

jlebar commented 7 months ago

This is not a bug in llvm, it's a bug in Triton.