Closed nivibilla closed 2 days ago
@nivibilla Hey. What do you mean by crashed? In your log, we haven't seen the error logging.
Hi @zhaochenyang20 sorry, here is the full trace. I think the issue is the fused moe Triton kernel not being supported on L4 gpus. Dense models work absolutely fine. It's just any MOE models crash with a segmentation fault.
*** SIGSEGV received at time=1724185428 on cpu 55 ***
PC: @ 0x5266a0 (unknown) (unknown)
@ 0x7fca1c095520 (unknown) (unknown)
@ 0x7fc8a36b1b40 (unknown) (unknown)
@ 0x95e040 (unknown) (unknown)
[2024-08-20 20:23:48,285 E 118260 118260] logging.cc:365: *** SIGSEGV received at time=1724185428 on cpu 55 ***
[2024-08-20 20:23:48,288 E 118260 118260] logging.cc:365: PC: @ 0x5266a0 (unknown) (unknown)
[2024-08-20 20:23:48,288 E 118260 118260] logging.cc:365: @ 0x7fca1c095520 (unknown) (unknown)
[2024-08-20 20:23:48,292 E 118260 118260] logging.cc:365: @ 0x7fc8a36b1b40 (unknown) (unknown)
[2024-08-20 20:23:48,298 E 118260 118260] logging.cc:365: @ 0x95e040 (unknown) (unknown)
Fatal Python error: Segmentation fault
Stack (most recent call first):
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1234 in ast_to_ttir
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/triton/compiler/compiler.py", line 117 in make_ir
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/triton/compiler/compiler.py", line 191 in compile
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/triton/runtime/jit.py", line 416 in run
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/triton/runtime/jit.py", line 167 in <lambda>
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 246 in invoke_fused_moe_kernel
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 513 in fused_experts
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 613 in fused_moe
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 74 in apply
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 209 in forward
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541 in _call_impl
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 96 in forward
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541 in _call_impl
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 233 in forward
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541 in _call_impl
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 277 in forward
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541 in _call_impl
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 349 in forward
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541 in _call_impl
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1341 in execute_model
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115 in decorate_context
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 923 in profile_run
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115 in decorate_context
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/worker/worker.py", line 179 in determine_num_available_blocks
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115 in decorate_context
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 332 in execute_method
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 310 in _run_workers
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 38 in determine_num_available_blocks
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 362 in _initialize_kv_caches
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 263 in __init__
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 520 in _init_engine
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 373 in __init__
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 444 in from_engine_args
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 224 in run_server
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/scripts.py", line 28 in serve
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/scripts.py", line 148 in main
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/bin/vllm", line 8 in <module>
Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, _brotli, simplejson._speedups, yaml._yaml, psutil._psutil_linux, psutil._psutil_posix, sentencepiece._sentencepiece, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, pvectorc, ujson, regex._regex, scipy._lib._ccallback_c, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, snappy._snappy, lz4._version, lz4.frame._frame, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.tslib, pandas._libs.lib, pandas._libs.hashing, pyarrow.lib, pyarrow._hdfsio, pandas._libs.ops, pyarrow._compute, pandas._libs.arrays, pandas._libs.index, pandas._libs.join, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.internals, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.tslibs.strptime, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json, _cffi_backend, pyarrow._parquet, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, xxhash._xxhash, pyarrow._json, markupsafe._speedups, PIL._imaging, grpc._cython.cygrpc, zmq.libzmq, zmq.backend.cython.context, zmq.backend.cython.message, zmq.backend.cython.socket, zmq.backend.cython._device, zmq.backend.cython._poll, zmq.backend.cython._proxy_steerable, zmq.backend.cython._version, zmq.backend.cython.error, zmq.backend.cython.utils, cuda_utils (total: 118)
/usr/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
9
Very similar issue here @zhaochenyang20 https://github.com/vllm-project/vllm/issues/5479
Thanks for pointing this out.
@zhaochenyang20 i think i have narrowed it down to the fused_moe triton kernel causing the issue. I was able to run the model at vllm--0.2.7 at which the fused moe was not implemented yet and it worked fine
Any possibility to implement fused_moe with group gemm instead of triton? @zhyncs @merrymercy @Ying1123 , the SegmentGEMMWrapper API should suffice.
Any possibility to implement fused_moe with group gemm instead of triton?
@yzh119 It is expected that TurboMind will be supported in about 2 weeks, and by then we will use TurboMind's MOE and Quant MOE. Implementing fused_moe with group gemm is useful for DeepSeek V2, so it'll also be considered. cc @ispobock
@zhyncs will the TurboMind implementation of MOE also use the fused_moe kernel or will it be using the groupgemm from flashinfer? Also if there is a PR that i can test out that would be great too.
@nivibilla The decision has not been finalized yet, please stay tuned.
This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.
Checklist
Describe the bug
Model fully loads, server runs and then instantly crashes
Reproduction
!python -m sglang.launch_server --model-path /local_disk0/mistralai/Mixtral-8x7B-Instruct-v0.1 --served-model-name mixtral-8x7b-v0.1 --host 0.0.0.0 --port 1234 --tp 8 --context-length 8192 --max-running-requests 32 --max-num-reqs 32 --disable-cuda-graph --enable-p2p-check
Environment