[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.
[X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
[X] 5. Please use English, otherwise it will be closed.
Describe the bug
python3 -m sglang.launch_server --model-path /root/.xinference/cache/qwen2_5-instruct-gptq-7b-Int8/ --port 30000 --mem-fraction-static 0.8 --kv-cache-dtype int8 --attention-backend triton --sampling-backend pytorch --tp-size 2
WARNING 11-08 04:42:43 rocm.py:13] fork method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden to spawn instead.
[2024-11-08 04:42:51] server_args=ServerArgs(model_path='/root/.xinference/cache/qwen2_5-instruct-gptq-7b-Int8/', tokenizer_path='/root/.xinference/cache/qwen2_5-instruct-gptq-7b-Int8/', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='int8', kvint4_groupsize=32, quantization=None, context_length=None, device='cuda', served_model_name='/root/.xinference/cache/qwen2_5-instruct-gptq-7b-Int8/', chat_template=None, is_embedding=False, host='127.0.0.1', port=30000, mem_fraction_static=0.8, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=2, stream_interval=1, random_seed=661408819, constrained_json_whitespace_pattern=None, decode_log_interval=40, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, watchdog_timeout=600, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='triton', sampling_backend='pytorch', grammar_backend='outlines', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_penalizer=False, disable_nan_detection=False, enable_overlap_schedule=False, enable_mixed_chunk=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1)
[2024-11-08 04:43:04 TP0] Init torch distributed begin.
[2024-11-08 04:43:04 TP1] Init torch distributed begin.
[2024-11-08 04:43:07 TP0] Load weight begin. avail mem=23.03 GB
[2024-11-08 04:43:07 TP1] Load weight begin. avail mem=23.47 GB
[2024-11-08 04:43:07 TP1] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-11-08 04:43:07 TP1] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-11-08 04:43:07 TP0] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-11-08 04:43:07 TP0] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 33% Completed | 1/3 [00:01<00:03, 1.74s/it]
Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:03<00:02, 2.02s/it]
[2024-11-08 04:43:12 TP1] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=19.09 GB
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:04<00:00, 1.50s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:04<00:00, 1.61s/it]
[2024-11-08 04:43:13 TP0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=18.65 GB
[2024-11-08 04:43:13 TP1] Memory pool end. avail mem=4.58 GB
[2024-11-08 04:43:13 TP0] Memory pool end. avail mem=4.14 GB
[2024-11-08 04:43:13 TP0] Capture cuda graph begin. This can take up to several minutes.
[2024-11-08 04:43:13 TP1] Capture cuda graph begin. This can take up to several minutes.
[2024-11-08 04:43:58 TP0] max_total_num_tokens=1033884, max_prefill_tokens=16384, max_running_requests=4097, context_len=32768
[2024-11-08 04:43:58 TP1] max_total_num_tokens=1033884, max_prefill_tokens=16384, max_running_requests=4097, context_len=32768
[2024-11-08 04:43:58] INFO: Started server process [95220]
[2024-11-08 04:43:58] INFO: Waiting for application startup.
[2024-11-08 04:43:58] INFO: Application startup complete.
[2024-11-08 04:43:58] INFO: Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2024-11-08 04:43:59] INFO: 127.0.0.1:41416 - "GET /get_model_info HTTP/1.1" 200 OK
[2024-11-08 04:43:59 TP0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-11-08 04:44:04 TP1] Detected errors during sampling! NaN in the logits.
[2024-11-08 04:44:04 TP0] Detected errors during sampling! NaN in the logits.
[2024-11-08 04:44:04 TP0] Detected errors during sampling! NaN in the logits.
[2024-11-08 04:44:04 TP1] Detected errors during sampling! NaN in the logits.
[2024-11-08 04:44:04 TP0] Detected errors during sampling! NaN in the logits.
[2024-11-08 04:44:04 TP1] Detected errors during sampling! NaN in the logits.
[2024-11-08 04:44:04 TP0] Detected errors during sampling! NaN in the logits.
[2024-11-08 04:44:04 TP1] Detected errors during sampling! NaN in the logits.
[2024-11-08 04:44:04 TP0] Detected errors during sampling! NaN in the logits.
[2024-11-08 04:44:04 TP1] Detected errors during sampling! NaN in the logits.
[2024-11-08 04:44:06 TP1] Detected errors during sampling! NaN in the logits.
[2024-11-08 04:44:06 TP0] Detected errors during sampling! NaN in the logits.
[2024-11-08 04:44:06 TP0] Detected errors during sampling! NaN in the logits.
[2024-11-08 04:44:06 TP1] Detected errors during sampling! NaN in the logits.
Checklist
Describe the bug
python3 -m sglang.launch_server --model-path /root/.xinference/cache/qwen2_5-instruct-gptq-7b-Int8/ --port 30000 --mem-fraction-static 0.8 --kv-cache-dtype int8 --attention-backend triton --sampling-backend pytorch --tp-size 2 WARNING 11-08 04:42:43 rocm.py:13]
fork
method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden tospawn
instead. [2024-11-08 04:42:51] server_args=ServerArgs(model_path='/root/.xinference/cache/qwen2_5-instruct-gptq-7b-Int8/', tokenizer_path='/root/.xinference/cache/qwen2_5-instruct-gptq-7b-Int8/', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='int8', kvint4_groupsize=32, quantization=None, context_length=None, device='cuda', served_model_name='/root/.xinference/cache/qwen2_5-instruct-gptq-7b-Int8/', chat_template=None, is_embedding=False, host='127.0.0.1', port=30000, mem_fraction_static=0.8, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=2, stream_interval=1, random_seed=661408819, constrained_json_whitespace_pattern=None, decode_log_interval=40, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, watchdog_timeout=600, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='triton', sampling_backend='pytorch', grammar_backend='outlines', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_penalizer=False, disable_nan_detection=False, enable_overlap_schedule=False, enable_mixed_chunk=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1) [2024-11-08 04:43:04 TP0] Init torch distributed begin. [2024-11-08 04:43:04 TP1] Init torch distributed begin. [2024-11-08 04:43:07 TP0] Load weight begin. avail mem=23.03 GB [2024-11-08 04:43:07 TP1] Load weight begin. avail mem=23.47 GB [2024-11-08 04:43:07 TP1] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries. [2024-11-08 04:43:07 TP1] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries. [2024-11-08 04:43:07 TP0] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries. [2024-11-08 04:43:07 TP0] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries. Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 33% Completed | 1/3 [00:01<00:03, 1.74s/it] Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:03<00:02, 2.02s/it] [2024-11-08 04:43:12 TP1] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=19.09 GB Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:04<00:00, 1.50s/it] Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:04<00:00, 1.61s/it][2024-11-08 04:43:13 TP0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=18.65 GB [2024-11-08 04:43:13 TP1] Memory pool end. avail mem=4.58 GB [2024-11-08 04:43:13 TP0] Memory pool end. avail mem=4.14 GB [2024-11-08 04:43:13 TP0] Capture cuda graph begin. This can take up to several minutes. [2024-11-08 04:43:13 TP1] Capture cuda graph begin. This can take up to several minutes. [2024-11-08 04:43:58 TP0] max_total_num_tokens=1033884, max_prefill_tokens=16384, max_running_requests=4097, context_len=32768 [2024-11-08 04:43:58 TP1] max_total_num_tokens=1033884, max_prefill_tokens=16384, max_running_requests=4097, context_len=32768 [2024-11-08 04:43:58] INFO: Started server process [95220] [2024-11-08 04:43:58] INFO: Waiting for application startup. [2024-11-08 04:43:58] INFO: Application startup complete. [2024-11-08 04:43:58] INFO: Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit) [2024-11-08 04:43:59] INFO: 127.0.0.1:41416 - "GET /get_model_info HTTP/1.1" 200 OK [2024-11-08 04:43:59 TP0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0 [2024-11-08 04:44:04 TP1] Detected errors during sampling! NaN in the logits. [2024-11-08 04:44:04 TP0] Detected errors during sampling! NaN in the logits. [2024-11-08 04:44:04 TP0] Detected errors during sampling! NaN in the logits. [2024-11-08 04:44:04 TP1] Detected errors during sampling! NaN in the logits. [2024-11-08 04:44:04 TP0] Detected errors during sampling! NaN in the logits. [2024-11-08 04:44:04 TP1] Detected errors during sampling! NaN in the logits. [2024-11-08 04:44:04 TP0] Detected errors during sampling! NaN in the logits. [2024-11-08 04:44:04 TP1] Detected errors during sampling! NaN in the logits. [2024-11-08 04:44:04 TP0] Detected errors during sampling! NaN in the logits. [2024-11-08 04:44:04 TP1] Detected errors during sampling! NaN in the logits. [2024-11-08 04:44:06 TP1] Detected errors during sampling! NaN in the logits. [2024-11-08 04:44:06 TP0] Detected errors during sampling! NaN in the logits. [2024-11-08 04:44:06 TP0] Detected errors during sampling! NaN in the logits. [2024-11-08 04:44:06 TP1] Detected errors during sampling! NaN in the logits.
Reproduction
python3 -m sglang.launch_server --model-path /root/.xinference/cache/qwen2_5-instruct-gptq-7b-Int8/ --port 30000 --mem-fraction-static 0.8 --kv-cache-dtype int8 --attention-backend triton --sampling-backend pytorch --tp-size 2
Environment
Name: gfx1100 Uuid: GPU-b1d1b7e55cd7ec87 Marketing Name: Radeon RX 7900 XTX Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 3 Device Type: GPU Cache Info: L1: 32(0x20) KB L2: 6144(0x1800) KB L3: 98304(0x18000) KB Chip ID: 29772(0x744c) ASIC Revision: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 2070 BDFID: 49920 Internal Node ID: 3 Compute Unit: 96 SIMDs per CU: 2 Shader Engines: 6 Shader Arrs. per Eng.: 2 WatchPts on Addr. Ranges:4 Coherent Host Access: FALSE Features: KERNEL_DISPATCH Fast F16 Operation: TRUE Wavefront Size: 32(0x20) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 32(0x20) Max Work-item Per CU: 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Packet Processor uCode:: 202 SDMA engine uCode:: 20 IOMMU Support:: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 25149440(0x17fc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:2048KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 2 Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED Size: 25149440(0x17fc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:2048KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 3 Segment: GROUP Size: 64(0x40) KB Allocatable: FALSE Alloc Granule: 0KB Alloc Recommended Granule:0KB Alloc Alignment: 0KB Accessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx1100 Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) FBarrier Max Size: 32 Done