[Bug]: segfault when loading MoE models

nivibilla commented 2 weeks ago

Your current environment

Anything you want to discuss about vllm.

Your current environment

The output of `python collect_env.py`

```text PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.30.2 Libc version: glibc-2.35 Python version: 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-1065-aws-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA L4 GPU 1: NVIDIA L4 GPU 2: NVIDIA L4 GPU 3: NVIDIA L4 GPU 4: NVIDIA L4 GPU 5: NVIDIA L4 GPU 6: NVIDIA L4 GPU 7: NVIDIA L4 Nvidia driver version: 535.161.07 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 192 On-line CPU(s) list: 0-191 Vendor ID: AuthenticAMD Model name: AMD EPYC 7R13 Processor CPU family: 25 Model: 1 Thread(s) per core: 2 Core(s) per socket: 48 Socket(s): 2 Stepping: 1 BogoMIPS: 5299.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid Hypervisor vendor: KVM Virtualization type: full L1d cache: 3 MiB (96 instances) L1i cache: 3 MiB (96 instances) L2 cache: 48 MiB (96 instances) L3 cache: 384 MiB (12 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-47,96-143 NUMA node1 CPU(s): 48-95,144-191 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; safe RET, no microcode Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.23.5 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] optree==0.12.1 [pip3] pyzmq==23.2.0 [pip3] sentence-transformers==2.7.0 [pip3] torch==2.4.0 [pip3] torcheval==0.0.7 [pip3] torchvision==0.19.0 [pip3] transformers==4.44.0 [pip3] triton==3.0.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.4@4db5176d9758b720b05460c50ace3c01026eb158 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NODE NODE NODE SYS SYS SYS SYS 0-47,96-143 0 N/A GPU1 NODE X NODE NODE SYS SYS SYS SYS 0-47,96-143 0 N/A GPU2 NODE NODE X NODE SYS SYS SYS SYS 0-47,96-143 0 N/A GPU3 NODE NODE NODE X SYS SYS SYS SYS 0-47,96-143 0 N/A GPU4 SYS SYS SYS SYS X NODE NODE NODE 48-95,144-191 1 N/A GPU5 SYS SYS SYS SYS NODE X NODE NODE 48-95,144-191 1 N/A GPU6 SYS SYS SYS SYS NODE NODE X NODE 48-95,144-191 1 N/A GPU7 SYS SYS SYS SYS NODE NODE NODE X 48-95,144-191 1 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```

🐛 Describe the bug

Cannot run mixtral 8x7b

INFO 08-20 20:23:00 api_server.py:212] vLLM API server version 0.5.2
INFO 08-20 20:23:00 api_server.py:213] args: Namespace(model_tag='mistralai/Mixtral-8x7B-Instruct-v0.1/', host='0.0.0.0', port=1234, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='mistralai/Mixtral-8x7B-Instruct-v0.1/', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=8192, guided_decoding_backend='outlines', distributed_executor_backend='ray', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=['mixtral-8x7b-instruct-v0.1'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7fc892b21800>)
2024-08-20 20:23:02,965 INFO worker.py:1740 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
INFO 08-20 20:23:11 llm_engine.py:174] Initializing an LLM engine (v0.5.2) with config: model='/dbfs/mnt/dna_pai_tvc/nbilla/hf_cache/mistralai/Mixtral-8x7B-Instruct-v0.1/', speculative_config=None, tokenizer='/dbfs/mnt/dna_pai_tvc/nbilla/hf_cache/mistralai/Mixtral-8x7B-Instruct-v0.1/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=mixtral-8x7b-instruct-v0.1, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-20 20:23:37 utils.py:737] Found nccl from library libnccl.so.2
INFO 08-20 20:23:37 pynccl.py:63] vLLM is using nccl==2.22.3
(RayWorkerWrapper pid=141052) INFO 08-20 20:23:37 utils.py:737] Found nccl from library libnccl.so.2
(RayWorkerWrapper pid=141052) INFO 08-20 20:23:37 pynccl.py:63] vLLM is using nccl==2.22.3
WARNING 08-20 20:23:37 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(RayWorkerWrapper pid=141052) WARNING 08-20 20:23:37 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 08-20 20:23:44 model_runner.py:266] Loading model weights took 10.8853 GB
(RayWorkerWrapper pid=141052) INFO 08-20 20:23:45 model_runner.py:266] Loading model weights took 10.8853 GB
(RayWorkerWrapper pid=142816) INFO 08-20 20:23:37 utils.py:737] Found nccl from library libnccl.so.2 [repeated 6x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(RayWorkerWrapper pid=142816) INFO 08-20 20:23:37 pynccl.py:63] vLLM is using nccl==2.22.3 [repeated 6x across cluster]
(RayWorkerWrapper pid=142816) WARNING 08-20 20:23:37 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. [repeated 6x across cluster]
*** SIGSEGV received at time=1724185428 on cpu 55 ***
PC: @           0x5266a0  (unknown)  (unknown)
    @     0x7fca1c095520  (unknown)  (unknown)
    @     0x7fc8a36b1b40  (unknown)  (unknown)
    @           0x95e040  (unknown)  (unknown)
[2024-08-20 20:23:48,285 E 118260 118260] logging.cc:365: *** SIGSEGV received at time=1724185428 on cpu 55 ***
[2024-08-20 20:23:48,288 E 118260 118260] logging.cc:365: PC: @           0x5266a0  (unknown)  (unknown)
[2024-08-20 20:23:48,288 E 118260 118260] logging.cc:365:     @     0x7fca1c095520  (unknown)  (unknown)
[2024-08-20 20:23:48,292 E 118260 118260] logging.cc:365:     @     0x7fc8a36b1b40  (unknown)  (unknown)
[2024-08-20 20:23:48,298 E 118260 118260] logging.cc:365:     @           0x95e040  (unknown)  (unknown)
Fatal Python error: Segmentation fault

Stack (most recent call first):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1234 in ast_to_ttir
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/triton/compiler/compiler.py", line 117 in make_ir
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/triton/compiler/compiler.py", line 191 in compile
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/triton/runtime/jit.py", line 416 in run
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/triton/runtime/jit.py", line 167 in <lambda>
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 246 in invoke_fused_moe_kernel
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 513 in fused_experts
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 613 in fused_moe
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 74 in apply
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 209 in forward
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541 in _call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 96 in forward
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541 in _call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 233 in forward
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541 in _call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 277 in forward
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541 in _call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 349 in forward
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541 in _call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1341 in execute_model
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115 in decorate_context
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 923 in profile_run
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115 in decorate_context
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/worker/worker.py", line 179 in determine_num_available_blocks
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115 in decorate_context
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 332 in execute_method
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 310 in _run_workers
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 38 in determine_num_available_blocks
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 362 in _initialize_kv_caches
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 263 in __init__
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 520 in _init_engine
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 373 in __init__
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 444 in from_engine_args
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 224 in run_server
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/scripts.py", line 28 in serve
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/scripts.py", line 148 in main
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/bin/vllm", line 8 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, _brotli, simplejson._speedups, yaml._yaml, psutil._psutil_linux, psutil._psutil_posix, sentencepiece._sentencepiece, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, pvectorc, ujson, regex._regex, scipy._lib._ccallback_c, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, snappy._snappy, lz4._version, lz4.frame._frame, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.tslib, pandas._libs.lib, pandas._libs.hashing, pyarrow.lib, pyarrow._hdfsio, pandas._libs.ops, pyarrow._compute, pandas._libs.arrays, pandas._libs.index, pandas._libs.join, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.internals, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.tslibs.strptime, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json, _cffi_backend, pyarrow._parquet, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, xxhash._xxhash, pyarrow._json, markupsafe._speedups, PIL._imaging, grpc._cython.cygrpc, zmq.libzmq, zmq.backend.cython.context, zmq.backend.cython.message, zmq.backend.cython.socket, zmq.backend.cython._device, zmq.backend.cython._poll, zmq.backend.cython._proxy_steerable, zmq.backend.cython._version, zmq.backend.cython.error, zmq.backend.cython.utils, cuda_utils (total: 118)
/usr/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
9

Mixtral or any mixture of experts model fails due to a error with the fused_moe triton kernel. I think its an issue with an nvidia driver but i cant easily update the driver on databricks.

downgrading to 0.2.7 works since the fused moe is not there, but models like deepseek dont work since they dont have a non fused implementation.

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

robertgshaw2-neuralmagic commented 2 weeks ago

I am looking into your issue

robertgshaw2-neuralmagic commented 2 weeks ago

Hey @nivibilla are you sure the output of collect_env.py is correct?

From the logs, it looks like you are using vllm v0.5.2, but in your collect_env.py, v0.5.4 is listed

nivibilla commented 2 weeks ago

Hey @robertgshaw2-neuralmagic I must have mixed up the logs but I can tell you for sure that I tested with 0.5.5 today and it still gave this error. I can do it again tomorrow and get you updated logs

robertgshaw2-neuralmagic commented 2 weeks ago

No worries. I am going to build an env that matches your collect_env.py and work off that

Can you quickly verify that the info is accurate?

nivibilla commented 2 weeks ago

@robertgshaw2-neuralmagic Actually I just set it to run now, will get updated logs with 30mins

nivibilla commented 2 weeks ago

The output of `python collect_env.py`

```text Collecting environment information... PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.22.1 Libc version: glibc-2.35 Python version: 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-1067-aws-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA L4 GPU 1: NVIDIA L4 GPU 2: NVIDIA L4 GPU 3: NVIDIA L4 GPU 4: NVIDIA L4 GPU 5: NVIDIA L4 GPU 6: NVIDIA L4 GPU 7: NVIDIA L4 Nvidia driver version: 535.161.07 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 192 On-line CPU(s) list: 0-191 Vendor ID: AuthenticAMD Model name: AMD EPYC 7R13 Processor CPU family: 25 Model: 1 Thread(s) per core: 2 Core(s) per socket: 48 Socket(s): 2 Stepping: 1 BogoMIPS: 5299.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid Hypervisor vendor: KVM Virtualization type: full L1d cache: 3 MiB (96 instances) L1i cache: 3 MiB (96 instances) L2 cache: 48 MiB (96 instances) L3 cache: 384 MiB (12 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-47,96-143 NUMA node1 CPU(s): 48-95,144-191 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; safe RET, no microcode Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.23.5 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.555.43 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.5.82 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] optree==0.12.1 [pip3] pyzmq==23.2.0 [pip3] sentence-transformers==2.7.0 [pip3] torch==2.4.0 [pip3] torcheval==0.0.7 [pip3] torchvision==0.19.0 [pip3] transformers==4.44.2 [pip3] triton==3.0.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.5@09c7792610ada9f88bbf87d32b472dd44bf23cc2 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NODE NODE NODE SYS SYS SYS SYS 0-47,96-143 0 N/A GPU1 NODE X NODE NODE SYS SYS SYS SYS 0-47,96-143 0 N/A GPU2 NODE NODE X NODE SYS SYS SYS SYS 0-47,96-143 0 N/A GPU3 NODE NODE NODE X SYS SYS SYS SYS 0-47,96-143 0 N/A GPU4 SYS SYS SYS SYS X NODE NODE NODE 48-95,144-191 1 N/A GPU5 SYS SYS SYS SYS NODE X NODE NODE 48-95,144-191 1 N/A GPU6 SYS SYS SYS SYS NODE NODE X NODE 48-95,144-191 1 N/A GPU7 SYS SYS SYS SYS NODE NODE NODE X 48-95,144-191 1 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```

nivibilla commented 2 weeks ago

2024-08-28 20:47:39.269035: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO 08-28 20:47:43 api_server.py:440] vLLM API server version 0.5.5
INFO 08-28 20:47:43 api_server.py:441] args: Namespace(model_tag='/local_disk0/deepseek-ai/DeepSeek-V2-Lite-Chat/', host='0.0.0.0', port=1234, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='/local_disk0/deepseek-ai/DeepSeek-V2-Lite-Chat/', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=8192, guided_decoding_backend='outlines', distributed_executor_backend='ray', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=True, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=32, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['deepseek-v2-lite-16b-chat'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7f7ec5ac7f60>)
INFO 08-28 20:47:43 api_server.py:144] Multiprocessing frontend to use ipc:///tmp/7f73cdf8-77fd-4f42-9129-716d2b7eb02e for RPC Path.
INFO 08-28 20:47:43 api_server.py:161] Started engine process with PID 12110
2024-08-28 20:48:14.000259: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-08-28 20:48:47,133 INFO worker.py:1772 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
INFO 08-28 20:48:54 llm_engine.py:184] Initializing an LLM engine (v0.5.5) with config: model='/local_disk0/deepseek-ai/DeepSeek-V2-Lite-Chat/', speculative_config=None, tokenizer='/local_disk0/deepseek-ai/DeepSeek-V2-Lite-Chat/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=deepseek-v2-lite-16b-chat, use_v2_block_manager=True, enable_prefix_caching=True)
INFO 08-28 20:48:55 ray_gpu_executor.py:133] use_ray_spmd_worker: False
(pid=35511) 2024-08-28 20:49:11.122670: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
(pid=35511) To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
(pid=36371) 2024-08-28 20:49:45.089059: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
(pid=36371) To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
(pid=36946) 2024-08-28 20:50:13.508833: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
(pid=36946) To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
(pid=37526) 2024-08-28 20:50:44.384781: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
(pid=37526) To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
(pid=38111) 2024-08-28 20:51:03.160547: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
(pid=38111) To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
(pid=38554) 2024-08-28 20:51:37.357566: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
(pid=38554) To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
(pid=39124) 2024-08-28 20:52:09.892823: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
(pid=39124) To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
(pid=39666) 2024-08-28 20:52:27.037792: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
(pid=39666) To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO 08-28 20:53:01 utils.py:975] Found nccl from library libnccl.so.2
INFO 08-28 20:53:01 pynccl.py:63] vLLM is using nccl==2.22.3
(RayWorkerWrapper pid=36371) INFO 08-28 20:53:01 utils.py:975] Found nccl from library libnccl.so.2
(RayWorkerWrapper pid=36371) INFO 08-28 20:53:01 pynccl.py:63] vLLM is using nccl==2.22.3
WARNING 08-28 20:53:02 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 08-28 20:53:02 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7f5060edbfd0>, local_subscribe_port=52543, remote_subscribe_port=None)
INFO 08-28 20:53:02 model_runner.py:879] Starting to load model /local_disk0/deepseek-ai/DeepSeek-V2-Lite-Chat/...
(RayWorkerWrapper pid=36371) WARNING 08-28 20:53:02 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(RayWorkerWrapper pid=36371) INFO 08-28 20:53:02 model_runner.py:879] Starting to load model /local_disk0/deepseek-ai/DeepSeek-V2-Lite-Chat/...
Cache shape torch.Size([163840, 64])
(RayWorkerWrapper pid=36371) Cache shape torch.Size([163840, 64])
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:01,  2.28it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:00<00:00,  2.01it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.91it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:01<00:00,  2.07it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:01<00:00,  2.05it/s]

INFO 08-28 20:53:04 model_runner.py:890] Loading model weights took 3.7374 GB
(RayWorkerWrapper pid=36946) INFO 08-28 20:53:07 model_runner.py:890] Loading model weights took 3.7374 GB
(RayWorkerWrapper pid=39666) INFO 08-28 20:53:01 utils.py:975] Found nccl from library libnccl.so.2 [repeated 6x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(RayWorkerWrapper pid=39666) INFO 08-28 20:53:01 pynccl.py:63] vLLM is using nccl==2.22.3 [repeated 6x across cluster]
(RayWorkerWrapper pid=39666) WARNING 08-28 20:53:02 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. [repeated 6x across cluster]
(RayWorkerWrapper pid=39666) INFO 08-28 20:53:02 model_runner.py:879] Starting to load model /local_disk0/deepseek-ai/DeepSeek-V2-Lite-Chat/... [repeated 6x across cluster]
(RayWorkerWrapper pid=39666) Cache shape torch.Size([163840, 64]) [repeated 6x across cluster]
(RayWorkerWrapper pid=36946) *** SIGSEGV received at time=1724878389 on cpu 152 ***
(RayWorkerWrapper pid=36946) PC: @           0x5266a0  (unknown)  (unknown)
(RayWorkerWrapper pid=36946)     @     0x7fd33ae47520   47342104  (unknown)
(RayWorkerWrapper pid=36946)     @     0x7fd2d84b6400  (unknown)  (unknown)
(RayWorkerWrapper pid=36946)     @           0x95e040  (unknown)  (unknown)
(RayWorkerWrapper pid=36946) [2024-08-28 20:53:09,597 E 36946 36946] logging.cc:440: *** SIGSEGV received at time=1724878389 on cpu 152 ***
(RayWorkerWrapper pid=36946) [2024-08-28 20:53:09,602 E 36946 36946] logging.cc:440: PC: @           0x5266a0  (unknown)  (unknown)
(RayWorkerWrapper pid=36946) [2024-08-28 20:53:09,602 E 36946 36946] logging.cc:440:     @     0x7fd33ae47520   47342104  (unknown)
(RayWorkerWrapper pid=36946) [2024-08-28 20:53:09,607 E 36946 36946] logging.cc:440:     @     0x7fd2d84b6400  (unknown)  (unknown)
(RayWorkerWrapper pid=36946) [2024-08-28 20:53:09,617 E 36946 36946] logging.cc:440:     @           0x95e040  (unknown)  (unknown)
(RayWorkerWrapper pid=36946) Fatal Python error: Segmentation fault
(RayWorkerWrapper pid=36946) 
(RayWorkerWrapper pid=36946) Stack (most recent call first):
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 223 in __init__
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1069 in call_JitFunction
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1109 in visit_Call
(RayWorkerWrapper pid=36946)   File "/usr/lib/python3.11/ast.py", line 410 in visit
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1204 in visit
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 897 in <listcomp>
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 897 in visit_For
(RayWorkerWrapper pid=36946)   File "/usr/lib/python3.11/ast.py", line 410 in visit
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1204 in visit
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 351 in visit_compound_statement
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 443 in visit_FunctionDef
(RayWorkerWrapper pid=36946)   File "/usr/lib/python3.11/ast.py", line 410 in visit
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1204 in visit
(RayWorkerWrapper pid=36946)   File "/usr/lib/python3.11/ast.py", line 418 in generic_visit
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 359 in visit_Module
(RayWorkerWrapper pid=36946)   File "/usr/lib/python3.11/ast.py", line 410 in visit
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1204 in visit
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1297 in ast_to_ttir
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/triton/compiler/compiler.py", line 113 in make_ir
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/triton/compiler/compiler.py", line 276 in compile
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/triton/runtime/jit.py", line 662 in run
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/triton/runtime/jit.py", line 345 in <lambda>
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 258 in invoke_fused_moe_kernel
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 565 in fused_experts
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 99 in forward_cuda
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/model_executor/custom_op.py", line 14 in forward
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 68 in apply
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 287 in forward
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v2.py", line 148 in forward
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v2.py", line 401 in forward
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v2.py", line 461 in forward
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v2.py", line 504 in forward
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1415 in execute_model
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1097 in profile_run
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/worker/worker.py", line 222 in determine_num_available_blocks
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 451 in execute_method
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/ray/util/tracing/tracing_helper.py", line 467 in _resume_span
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/ray/_private/function_manager.py", line 691 in actor_method_executor
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/ray/_private/worker.py", line 887 in main_loop
(RayWorkerWrapper pid=36946)   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/ray/_private/workers/default_worker.py", line 289 in <module>
(RayWorkerWrapper pid=36946) 
(RayWorkerWrapper pid=36946) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, _brotli, simplejson._speedups, uvloop.loop, ray._raylet, pvectorc, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, msgspec._core, sentencepiece._sentencepiece, PIL._imaging, PIL._imagingft, h5py._errors, h5py.defs, h5py._objects, h5py.h5, h5py.utils, h5py.h5t, h5py.h5s, h5py.h5ac, h5py.h5p, h5py.h5r, h5py._proxy, h5py._conv, h5py.h5z, h5py.h5a, h5py.h5d, h5py.h5ds, h5py.h5g, h5py.h5i, h5py.h5f, h5py.h5fd, h5py.h5pl, h5py.h5o, h5py.h5l, h5py._selector, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._isolve._iterative, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.linalg._flinalg, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.tslib, pandas._libs.lib, pandas._libs.hashing, pyarrow.lib, pyarrow._hdfsio, pandas._libs.ops, pyarrow._compute, pandas._libs.arrays, pandas._libs.index, pandas._libs.join, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.internals, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.tslibs.strptime, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, _cffi_backend, regex._regex, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, snappy._snappy, lz4._version, lz4.frame._frame, pyarrow._parquet, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, xxhash._xxhash, pyarrow._json, markupsafe._speedups, zmq.libzmq, zmq.backend.cython.context, zmq.backend.cython.message, zmq.backend.cython.socket, zmq.backend.cython._device, zmq.backend.cython._poll, zmq.backend.cython._proxy_steerable, zmq.backend.cython._version, zmq.backend.cython.error, zmq.backend.cython.utils, cuda_utils, __triton_launcher (total: 167)
*** SIGSEGV received at time=1724878389 on cpu 3 ***
PC: @           0x5266a0  (unknown)  (unknown)
    @     0x7f531f9ac520  (unknown)  (unknown)
    @     0x7f51aa275180  (unknown)  (unknown)
    @           0x95e040  (unknown)  (unknown)
[2024-08-28 20:53:09,763 E 12110 12110] logging.cc:440: *** SIGSEGV received at time=1724878389 on cpu 3 ***
[2024-08-28 20:53:09,768 E 12110 12110] logging.cc:440: PC: @           0x5266a0  (unknown)  (unknown)
[2024-08-28 20:53:09,768 E 12110 12110] logging.cc:440:     @     0x7f531f9ac520  (unknown)  (unknown)
[2024-08-28 20:53:09,774 E 12110 12110] logging.cc:440:     @     0x7f51aa275180  (unknown)  (unknown)
[2024-08-28 20:53:09,784 E 12110 12110] logging.cc:440:     @           0x95e040  (unknown)  (unknown)
Fatal Python error: Segmentation fault

Stack (most recent call first):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 223 in __init__
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1069 in call_JitFunction
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1109 in visit_Call
  File "/usr/lib/python3.11/ast.py", line 410 in visit
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1204 in visit
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 897 in <listcomp>
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 897 in visit_For
  File "/usr/lib/python3.11/ast.py", line 410 in visit
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1204 in visit
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 351 in visit_compound_statement
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 443 in visit_FunctionDef
  File "/usr/lib/python3.11/ast.py", line 410 in visit
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1204 in visit
  File "/usr/lib/python3.11/ast.py", line 418 in generic_visit
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 359 in visit_Module
  File "/usr/lib/python3.11/ast.py", line 410 in visit
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1204 in visit
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1297 in ast_to_ttir
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/triton/compiler/compiler.py", line 113 in make_ir
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/triton/compiler/compiler.py", line 276 in compile
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/triton/runtime/jit.py", line 662 in run
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/triton/runtime/jit.py", line 345 in <lambda>
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 258 in invoke_fused_moe_kernel
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 565 in fused_experts
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 99 in forward_cuda
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/model_executor/custom_op.py", line 14 in forward
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 68 in apply
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 287 in forward
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v2.py", line 148 in forward
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v2.py", line 401 in forward
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v2.py", line 461 in forward
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v2.py", line 504 in forward
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1415 in execute_model
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1097 in profile_run
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/worker/worker.py", line 222 in determine_num_available_blocks
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 451 in execute_method
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 407 in _run_workers
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 38 in determine_num_available_blocks
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 390 in _initialize_kv_caches
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 284 in __init__
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 272 in __init__
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 840 in _init_engine
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 636 in __init__
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 740 in from_engine_args
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/entrypoints/openai/rpc/server.py", line 31 in __init__
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-0641b28f-50c7-4389-802a-920415ccffef/lib/python3.11/site-packages/vllm/entrypoints/openai/rpc/server.py", line 230 in run_rpc_server
  File "/usr/lib/python3.11/multiprocessing/process.py", line 108 in run
  File "/usr/lib/python3.11/multiprocessing/process.py", line 314 in _bootstrap
  File "/usr/lib/python3.11/multiprocessing/spawn.py", line 133 in _main
  File "/usr/lib/python3.11/multiprocessing/spawn.py", line 120 in spawn_main
  File "<string>", line 1 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, _brotli, simplejson._speedups, yaml._yaml, msgspec._core, psutil._psutil_linux, psutil._psutil_posix, sentencepiece._sentencepiece, PIL._imaging, PIL._imagingft, google._upb._message, h5py._errors, h5py.defs, h5py._objects, h5py.h5, h5py.utils, h5py.h5t, h5py.h5s, h5py.h5ac, h5py.h5p, h5py.h5r, h5py._proxy, h5py._conv, h5py.h5z, h5py.h5a, h5py.h5d, h5py.h5ds, h5py.h5g, h5py.h5i, h5py.h5f, h5py.h5fd, h5py.h5pl, h5py.h5o, h5py.h5l, h5py._selector, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._isolve._iterative, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.linalg._flinalg, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.tslib, pandas._libs.lib, pandas._libs.hashing, pyarrow.lib, pyarrow._hdfsio, pandas._libs.ops, pyarrow._compute, pandas._libs.arrays, pandas._libs.index, pandas._libs.join, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.internals, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.tslibs.strptime, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json, msgpack._cmsgpack, setproctitle, uvloop.loop, ray._raylet, pvectorc, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, _cffi_backend, regex._regex, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, snappy._snappy, lz4._version, lz4.frame._frame, pyarrow._parquet, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, xxhash._xxhash, pyarrow._json, markupsafe._speedups, ujson, zmq.libzmq, zmq.backend.cython.context, zmq.backend.cython.message, zmq.backend.cython.socket, zmq.backend.cython._device, zmq.backend.cython._poll, zmq.backend.cython._proxy_steerable, zmq.backend.cython._version, zmq.backend.cython.error, zmq.backend.cython.utils, grpc._cython.cygrpc, cuda_utils, __triton_launcher (total: 169)
ERROR 08-28 20:53:13 api_server.py:171] RPCServer process died before responding to readiness probe
/usr/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/usr/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

nivibilla commented 2 weeks ago

@robertgshaw2-neuralmagic please see the newest logs. I just ran this now on databricks with g6.48xlarge and DBR 15.4 ML runtime.

robertgshaw2-neuralmagic commented 2 weeks ago

@robertgshaw2-neuralmagic please see the newest logs. I just ran this now on databricks with g6.48xlarge and DBR 15.4 ML runtime.

Great - will try to reproduce tomorrow

fengyang95 commented 1 week ago

Similar issue, I am using L40*8, trying to load the deepseek-coder-v2-instruct model. Using the following command:

# with vllm==0.5.5 cuda12.4
python3 -m vllm.entrypoints.openai.api_server --model $LOCAL_PATH --served-model-name dsv2 --trust-remote-code --tensor-parallel-size 8 --max-model-len 16384 --port 10002 --dtype auto --root-path $ROUTE_PATH --gpu-memory-utilization 0.9 --cpu-offload-gb 35 >> deepseek_7b.log 2>&1

Then it throws an error directly.

READS in the external environment to tune this value as needed.
INFO 09-03 16:12:39 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=23415) INFO 09-03 16:12:39 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=23416) INFO 09-03 16:12:39 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=23417) INFO 09-03 16:12:39 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=23418) INFO 09-03 16:12:39 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=23419) INFO 09-03 16:12:39 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=23420) INFO 09-03 16:12:39 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=23421) INFO 09-03 16:12:39 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=23416) INFO 09-03 16:12:41 utils.py:975] Found nccl from library libnccl.so.2
INFO 09-03 16:12:41 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=23418) INFO 09-03 16:12:41 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=23417) INFO 09-03 16:12:41 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=23415) INFO 09-03 16:12:41 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=23416) INFO 09-03 16:12:41 pynccl.py:63] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=23419) INFO 09-03 16:12:41 utils.py:975] Found nccl from library libnccl.so.2
INFO 09-03 16:12:41 pynccl.py:63] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=23420) INFO 09-03 16:12:41 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=23421) INFO 09-03 16:12:41 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=23417) INFO 09-03 16:12:41 pynccl.py:63] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=23418) INFO 09-03 16:12:41 pynccl.py:63] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=23415) INFO 09-03 16:12:41 pynccl.py:63] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=23419) INFO 09-03 16:12:41 pynccl.py:63] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=23420) INFO 09-03 16:12:41 pynccl.py:63] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=23421) INFO 09-03 16:12:41 pynccl.py:63] vLLM is using nccl==2.21.5
ERROR 09-03 16:12:45 api_server.py:171] RPCServer process died before responding to readiness probe

nivibilla commented 1 week ago

@fengyang95 I've noticed that using ray as a backend gives better error traces. Maybe if you try loading with ray you'll see the error before the RPCServer dies.

fengyang95 commented 1 week ago

@fengyang95 I've noticed that using ray as a backend gives better error traces. Maybe if you try loading with ray you'll see the error before the RPCServer dies.

@nivibilla with --worker-use-ray ?

nivibilla commented 1 week ago

@fengyang95 use --distributed-executor-backend ray

fengyang95 commented 1 week ago

--distributed-executor-backend ray This seems to be another error. Did I configure something incorrectly?

INFO 09-03 19:48:00 api_server.py:440] vLLM API server version 0.5.5
INFO 09-03 19:48:00 api_server.py:441] args: Namespace(host=None, port=9233, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path='/models/deepseek/v2/code_ai_quant', middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='/opt/tiger/deepseek_http/deepseek_7b', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=16384, guided_decoding_backend='outlines', distributed_executor_backend='ray', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=35.0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['dsv2'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 09-03 19:48:00 api_server.py:144] Multiprocessing frontend to use ipc:///tmp/0a9dffed-d01a-47c9-ab13-e31c29a82190 for RPC Path.
INFO 09-03 19:48:00 api_server.py:161] Started engine process with PID 123773
2024-09-03 19:48:04,976 INFO worker.py:1783 -- Started a local Ray instance.
(autoscaler +6s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +6s) Error: No available node types can fulfill resource request {'GPU': 1.0, 'node:2605:340:cd51:3c00:aff7:e27a:77b:41bf': 0.001}. Add suitable node types to this cluster to resolve this issue.
INFO 09-03 19:48:18 ray_utils.py:178] Waiting for creating a placement group of specs for 10 seconds. specs=[{'GPU': 1.0, 'node:2605:340:cd51:3c00:aff7:e27a:77b:41bf': 0.001}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}]. Check `ray status` to see if you have enough resources.

nivibilla commented 1 week ago

@fengyang95 are you sure your cluster is a single node with 8gpus? Can you run nvidia-smi to check.

robertgshaw2-neuralmagic commented 1 week ago

Update from my side is that I have still not been able to repo. I’m working on making a DBRX account so I can use the same env as you have here

fengyang95 commented 1 week ago

Yes @nivibilla

nivibilla commented 1 week ago

@robertgshaw2-neuralmagic thank you! I can confirm however that I got the same error today with the same g6.48x cluster when I tried to load the deepseek v2 model. Is there anything else I can do to help you with it? I'm happy to jump of call or something to help you debug it.

robertgshaw2-neuralmagic commented 1 week ago

@robertgshaw2-neuralmagic thank you! I can confirm however that I got the same error today with the same g6.48x cluster when I tried to load the deepseek v2 model. Is there anything else I can do to help you with it? I'm happy to jump of call or something to help you debug it.

Can you send me a note to rshaw[at]neuralmagic[dot]com

nivibilla commented 1 week ago

Done.

Also @fengyang95 you might want to make another issue for your one. Don't think it's Todo with the Triton kernel. Your vllm instance isn't recognizing the gpus for some reason.

fengyang95 commented 1 week ago

@nivibilla Maybe it's related to the CUDA & kernel versions? I switched to the following environment, and then it started normally.

Collecting environment information...
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 11 (bullseye) (x86_64)
GCC version: (Debian 10.2.1-6) 10.2.1 20210110
Clang version: Could not collect
CMake version: version 3.27.5
Libc version: glibc-2.31

Python version: 3.9.2 (default, Feb 28 2021, 17:03:44)  [GCC 10.2.1 20210110] (64-bit runtime)
Python platform: Linux-5.4.143.bsk.8-amd64-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.7.99
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA L40
GPU 1: NVIDIA L40
GPU 2: NVIDIA L40
GPU 3: NVIDIA L40
GPU 4: NVIDIA L40
GPU 5: NVIDIA L40
GPU 6: NVIDIA L40
GPU 7: NVIDIA L40

Nvidia driver version: Could not collect
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.5.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   52 bits physical, 57 bits virtual
CPU(s):                          180
On-line CPU(s) list:             0-179
Thread(s) per core:              2
Core(s) per socket:              45
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           143
Model name:                      Intel(R) Xeon(R) Platinum 8457C
Stepping:                        8
CPU MHz:                         2598.271
BogoMIPS:                        5196.54
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       4.2 MiB
L1i cache:                       2.8 MiB
L2 cache:                        180 MiB
L3 cache:                        195 MiB
NUMA node0 CPU(s):               0-89
NUMA node1 CPU(s):               90-179
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Mitigation; TSX disabled
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 wbnoinvd arat avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid cldemote movdiri movdir64b md_clear arch_capabilities

Versions of relevant libraries:
[pip3] numpy==1.24.3
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.68
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchaudio==2.0.2
[pip3] torchvision==0.19.0
[pip3] transformers==4.44.2
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.5@09c7792610ada9f88bbf87d32b472dd44bf23cc2
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     2-89    0               N/A
GPU1    NODE     X      NODE    NODE    SYS     SYS     SYS     SYS     SYS     2-89    0               N/A
GPU2    NODE    NODE     X      NODE    SYS     SYS     SYS     SYS     SYS     2-89    0               N/A
GPU3    NODE    NODE    NODE     X      SYS     SYS     SYS     SYS     SYS     2-89    0               N/A
GPU4    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    SYS     92-177  1               N/A
GPU5    SYS     SYS     SYS     SYS     NODE     X      NODE    NODE    SYS     92-177  1               N/A
GPU6    SYS     SYS     SYS     SYS     NODE    NODE     X      NODE    SYS     92-177  1               N/A
GPU7    SYS     SYS     SYS     SYS     NODE    NODE    NODE     X      SYS     92-177  1               N/A
NIC0    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0

nivibilla commented 1 week ago

@robertgshaw2-neuralmagic I noticed there is a QuantMixtralForCausalLM arch which uses quant_mixtral.py

And by monkeypatching the architecture on the mixtral model im able to run and infer even with the latest 0.5.5 vllm like this, essentially its using the mixtral_quant file but without any quantisation. I think if i can modify other model files like Jamba or Deepseek to not use the fused moe and instead go back to the module list style implementation they will probably work, albeit slower.

import json

file_path = '/local_disk0/mistralai/Mixtral-8x7B-Instruct-v0.1/config.json'
with open(file_path, 'r') as f:
    config_data = json.load(f)

config_data['architectures'] = ["QuantMixtralForCausalLM"]

with open(file_path, 'w') as f:
    json.dump(config_data, f, indent=2)

vllm-project / vllm