sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sgl-project.github.io/
Apache License 2.0
5.91k stars 482 forks source link

[Bug] Server crashes after loading (Mixtral 8x7b) on L4 #1191

Closed nivibilla closed 2 days ago

nivibilla commented 2 months ago

Checklist

Describe the bug

Model fully loads, server runs and then instantly crashes

server_args=ServerArgs(model_path='/local_disk0/mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer_path='/local_disk0/mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', dtype='auto', trust_remote_code=False, context_length=8192, quantization=None, served_model_name='mixtral-8x7b-v0.1', chat_template=None, host='0.0.0.0', port=1234, additional_ports=[1235, 1236, 1237, 1238], mem_fraction_static=0.83, max_running_requests=32, max_num_reqs=32, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=8, stream_interval=1, random_seed=759329088, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', dp_size=1, load_balance_method='round_robin', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=True, disable_disk_cache=False, enable_torch_compile=False, enable_p2p_check=True, enable_mla=False, attention_reduce_in_fp32=False, efficient_weight_load=False, nccl_init_addr=None, nnodes=1, node_rank=None)
[gpu=0] Init nccl begin.
[gpu=5] Init nccl begin.
[gpu=7] Init nccl begin.
[gpu=1] Init nccl begin.
[gpu=3] Init nccl begin.
[gpu=6] Init nccl begin.
[gpu=2] Init nccl begin.
[gpu=4] Init nccl begin.
WARNING 08-23 11:04:07 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 08-23 11:04:07 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 08-23 11:04:07 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 08-23 11:04:07 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 08-23 11:04:07 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 08-23 11:04:07 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 08-23 11:04:07 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 08-23 11:04:07 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[gpu=6] Load weight begin. avail mem=21.65 GB
[gpu=5] Load weight begin. avail mem=21.65 GB
[gpu=7] Load weight begin. avail mem=21.65 GB
[gpu=4] Load weight begin. avail mem=21.65 GB
[gpu=3] Load weight begin. avail mem=21.65 GB
[gpu=1] Load weight begin. avail mem=21.65 GB
[gpu=0] Load weight begin. avail mem=21.65 GB
[gpu=2] Load weight begin. avail mem=21.65 GB
Loading safetensors checkpoint shards:   0% Completed | 0/19 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   5% Completed | 1/19 [00:00<00:13,  1.37it/s]
Loading safetensors checkpoint shards:  11% Completed | 2/19 [00:01<00:14,  1.19it/s]
Loading safetensors checkpoint shards:  16% Completed | 3/19 [00:02<00:14,  1.14it/s]
Loading safetensors checkpoint shards:  21% Completed | 4/19 [00:03<00:13,  1.07it/s]
Loading safetensors checkpoint shards:  26% Completed | 5/19 [00:04<00:13,  1.02it/s]
Loading safetensors checkpoint shards:  32% Completed | 6/19 [00:05<00:13,  1.01s/it]
Loading safetensors checkpoint shards:  37% Completed | 7/19 [00:06<00:12,  1.01s/it]
[gpu=7] Load weight end. type=MixtralForCausalLM, dtype=torch.bfloat16, avail mem=10.75 GB
Loading safetensors checkpoint shards:  42% Completed | 8/19 [00:07<00:11,  1.02s/it]
Loading safetensors checkpoint shards:  47% Completed | 9/19 [00:08<00:09,  1.01it/s]
Loading safetensors checkpoint shards:  53% Completed | 10/19 [00:09<00:08,  1.08it/s]
Loading safetensors checkpoint shards:  58% Completed | 11/19 [00:10<00:07,  1.08it/s]
Loading safetensors checkpoint shards:  63% Completed | 12/19 [00:11<00:06,  1.07it/s]
Loading safetensors checkpoint shards:  68% Completed | 13/19 [00:12<00:05,  1.07it/s]
Loading safetensors checkpoint shards:  74% Completed | 14/19 [00:13<00:04,  1.04it/s]
Loading safetensors checkpoint shards:  79% Completed | 15/19 [00:14<00:03,  1.04it/s]
Loading safetensors checkpoint shards:  84% Completed | 16/19 [00:15<00:02,  1.03it/s]
Loading safetensors checkpoint shards:  89% Completed | 17/19 [00:16<00:02,  1.00s/it]
Loading safetensors checkpoint shards:  95% Completed | 18/19 [00:17<00:00,  1.01it/s]
Loading safetensors checkpoint shards: 100% Completed | 19/19 [00:18<00:00,  1.05it/s]
Loading safetensors checkpoint shards: 100% Completed | 19/19 [00:18<00:00,  1.05it/s]

[gpu=3] Load weight end. type=MixtralForCausalLM, dtype=torch.bfloat16, avail mem=10.75 GB
[gpu=5] Load weight end. type=MixtralForCausalLM, dtype=torch.bfloat16, avail mem=10.75 GB
[gpu=4] Load weight end. type=MixtralForCausalLM, dtype=torch.bfloat16, avail mem=10.75 GB
[gpu=0] Load weight end. type=MixtralForCausalLM, dtype=torch.bfloat16, avail mem=10.75 GB
[gpu=6] Load weight end. type=MixtralForCausalLM, dtype=torch.bfloat16, avail mem=10.75 GB
[gpu=1] Load weight end. type=MixtralForCausalLM, dtype=torch.bfloat16, avail mem=10.75 GB
[gpu=2] Load weight end. type=MixtralForCausalLM, dtype=torch.bfloat16, avail mem=10.75 GB
[gpu=3] Memory pool end. avail mem=3.63 GB
[gpu=2] Memory pool end. avail mem=3.63 GB
[gpu=5] Memory pool end. avail mem=3.63 GB
[gpu=1] Memory pool end. avail mem=3.63 GB
[gpu=6] Memory pool end. avail mem=3.63 GB
[gpu=7] Memory pool end. avail mem=3.63 GB
[gpu=4] Memory pool end. avail mem=3.63 GB
[gpu=0] Memory pool end. avail mem=3.63 GB
[gpu=1] max_total_num_tokens=463405, max_prefill_tokens=16384, max_running_requests=31, context_len=8192
[gpu=7] max_total_num_tokens=463405, max_prefill_tokens=16384, max_running_requests=31, context_len=8192
[gpu=3] max_total_num_tokens=463405, max_prefill_tokens=16384, max_running_requests=31, context_len=8192
[gpu=6] max_total_num_tokens=463405, max_prefill_tokens=16384, max_running_requests=31, context_len=8192
[gpu=4] max_total_num_tokens=463405, max_prefill_tokens=16384, max_running_requests=31, context_len=8192
[gpu=0] max_total_num_tokens=463405, max_prefill_tokens=16384, max_running_requests=31, context_len=8192
[gpu=5] max_total_num_tokens=463405, max_prefill_tokens=16384, max_running_requests=31, context_len=8192
[gpu=2] max_total_num_tokens=463405, max_prefill_tokens=16384, max_running_requests=31, context_len=8192
INFO:     Started server process [28350]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:1234/ (Press CTRL+C to quit)
INFO:     127.0.0.1:55458 - "GET /get_model_info HTTP/1.1" 200 OK
[gpu=0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, #running-req: 0, #queue-req: 0
/usr/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Reproduction

!python -m sglang.launch_server --model-path /local_disk0/mistralai/Mixtral-8x7B-Instruct-v0.1 --served-model-name mixtral-8x7b-v0.1 --host 0.0.0.0 --port 1234 --tp 8 --context-length 8192 --max-running-requests 32 --max-num-reqs 32 --disable-cuda-graph --enable-p2p-check

Environment

Python: 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA L4
GPU 0,1,2,3,4,5,6,7 Compute Capability: 8.9
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
CUDA Driver Version: 535.161.07
PyTorch: 2.4.0+cu121
sglang: 0.2.13
flashinfer: 0.1.5+cu124torch2.4
triton: 3.0.0
transformers: 4.44.2
requests: 2.31.0
tqdm: 4.65.0
numpy: 1.23.5
aiohttp: 3.8.5
fastapi: 0.112.1
hf_transfer: 0.1.8
huggingface_hub: 0.24.6
interegular: 0.3.3
packaging: 23.2
PIL: 9.4.0
psutil: 5.9.0
pydantic: 2.8.2
uvicorn: 0.30.6
uvloop: 0.20.0
zmq: 23.2.0
vllm: 0.5.4
multipart: 0.0.9
openai: 1.42.0
anthropic: 0.34.1
NVIDIA Topology: 
    GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  NODE    NODE    NODE    SYS SYS SYS SYS 0-47,96-143 0       N/A
GPU1    NODE     X  NODE    NODE    SYS SYS SYS SYS 0-47,96-143 0       N/A
GPU2    NODE    NODE     X  NODE    SYS SYS SYS SYS 0-47,96-143 0       N/A
GPU3    NODE    NODE    NODE     X  SYS SYS SYS SYS 0-47,96-143 0       N/A
GPU4    SYS SYS SYS SYS  X  NODE    NODE    NODE    48-95,144-191   1       N/A
GPU5    SYS SYS SYS SYS NODE     X  NODE    NODE    48-95,144-191   1       N/A
GPU6    SYS SYS SYS SYS NODE    NODE     X  NODE    48-95,144-191   1       N/A
GPU7    SYS SYS SYS SYS NODE    NODE    NODE     X  48-95,144-191   1       N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

ulimit soft: 1000000
zhaochenyang20 commented 2 months ago

@nivibilla Hey. What do you mean by crashed? In your log, we haven't seen the error logging.

nivibilla commented 2 months ago

Hi @zhaochenyang20 sorry, here is the full trace. I think the issue is the fused moe Triton kernel not being supported on L4 gpus. Dense models work absolutely fine. It's just any MOE models crash with a segmentation fault.

*** SIGSEGV received at time=1724185428 on cpu 55 ***
PC: @           0x5266a0  (unknown)  (unknown)
    @     0x7fca1c095520  (unknown)  (unknown)
    @     0x7fc8a36b1b40  (unknown)  (unknown)
    @           0x95e040  (unknown)  (unknown)
[2024-08-20 20:23:48,285 E 118260 118260] logging.cc:365: *** SIGSEGV received at time=1724185428 on cpu 55 ***
[2024-08-20 20:23:48,288 E 118260 118260] logging.cc:365: PC: @           0x5266a0  (unknown)  (unknown)
[2024-08-20 20:23:48,288 E 118260 118260] logging.cc:365:     @     0x7fca1c095520  (unknown)  (unknown)
[2024-08-20 20:23:48,292 E 118260 118260] logging.cc:365:     @     0x7fc8a36b1b40  (unknown)  (unknown)
[2024-08-20 20:23:48,298 E 118260 118260] logging.cc:365:     @           0x95e040  (unknown)  (unknown)
Fatal Python error: Segmentation fault

Stack (most recent call first):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1234 in ast_to_ttir
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/triton/compiler/compiler.py", line 117 in make_ir
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/triton/compiler/compiler.py", line 191 in compile
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/triton/runtime/jit.py", line 416 in run
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/triton/runtime/jit.py", line 167 in <lambda>
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 246 in invoke_fused_moe_kernel
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 513 in fused_experts
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 613 in fused_moe
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 74 in apply
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 209 in forward
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541 in _call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 96 in forward
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541 in _call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 233 in forward
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541 in _call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 277 in forward
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541 in _call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 349 in forward
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541 in _call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1341 in execute_model
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115 in decorate_context
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 923 in profile_run
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115 in decorate_context
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/worker/worker.py", line 179 in determine_num_available_blocks
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115 in decorate_context
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 332 in execute_method
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 310 in _run_workers
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 38 in determine_num_available_blocks
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 362 in _initialize_kv_caches
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 263 in __init__
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 520 in _init_engine
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 373 in __init__
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 444 in from_engine_args
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 224 in run_server
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/scripts.py", line 28 in serve
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/lib/python3.11/site-packages/vllm/scripts.py", line 148 in main
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-295657cb-2b00-4279-ab0b-937f60f2f532/bin/vllm", line 8 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, _brotli, simplejson._speedups, yaml._yaml, psutil._psutil_linux, psutil._psutil_posix, sentencepiece._sentencepiece, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, pvectorc, ujson, regex._regex, scipy._lib._ccallback_c, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, snappy._snappy, lz4._version, lz4.frame._frame, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.tslib, pandas._libs.lib, pandas._libs.hashing, pyarrow.lib, pyarrow._hdfsio, pandas._libs.ops, pyarrow._compute, pandas._libs.arrays, pandas._libs.index, pandas._libs.join, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.internals, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.tslibs.strptime, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json, _cffi_backend, pyarrow._parquet, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, xxhash._xxhash, pyarrow._json, markupsafe._speedups, PIL._imaging, grpc._cython.cygrpc, zmq.libzmq, zmq.backend.cython.context, zmq.backend.cython.message, zmq.backend.cython.socket, zmq.backend.cython._device, zmq.backend.cython._poll, zmq.backend.cython._proxy_steerable, zmq.backend.cython._version, zmq.backend.cython.error, zmq.backend.cython.utils, cuda_utils (total: 118)
/usr/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
9
nivibilla commented 2 months ago

Very similar issue here @zhaochenyang20 https://github.com/vllm-project/vllm/issues/5479

zhaochenyang20 commented 2 months ago

Thanks for pointing this out.

nivibilla commented 2 months ago

@zhaochenyang20 i think i have narrowed it down to the fused_moe triton kernel causing the issue. I was able to run the model at vllm--0.2.7 at which the fused moe was not implemented yet and it worked fine

yzh119 commented 2 months ago

Any possibility to implement fused_moe with group gemm instead of triton? @zhyncs @merrymercy @Ying1123 , the SegmentGEMMWrapper API should suffice.

zhyncs commented 2 months ago

Any possibility to implement fused_moe with group gemm instead of triton?

@yzh119 It is expected that TurboMind will be supported in about 2 weeks, and by then we will use TurboMind's MOE and Quant MOE. Implementing fused_moe with group gemm is useful for DeepSeek V2, so it'll also be considered. cc @ispobock

nivibilla commented 2 months ago

@zhyncs will the TurboMind implementation of MOE also use the fused_moe kernel or will it be using the groupgemm from flashinfer? Also if there is a PR that i can test out that would be great too.

zhyncs commented 2 months ago

@nivibilla The decision has not been finalized yet, please stay tuned.

github-actions[bot] commented 2 days ago

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.