Closed bltcn closed 3 months ago
python3 -m sglang.launch_server --model-path /root/hf_model/Qwen/Qwen2-72B-Instruct-GPTQ-Int4 --dtype half --trust-remote-code --tp-size 4
This should be enough to launch the server. Other arguments are not required. I tested it on my machine and it works well.
sorry, it's not ok. Please notice that my GPU is 2080ti * 4. I use your comand and get this: root@b1107d56399c:/sgl-workspace# python3 -m sglang.launch_server --model-path /root/hf_model/Qwen/Qwen2-72B-Instruct-GPTQ-Int4 --dtype half --trust-remote-code --tp-size 4 server_args=ServerArgs(model_path='/root/hf_model/Qwen/Qwen2-72B-Instruct-GPTQ-Int4', tokenizer_path='/root/hf_model/Qwen/Qwen2-72B-Instruct-GPTQ-Int4', tokenizer_mode='auto', load_format='auto', dtype='half', trust_remote_code=True, context_length=None, quantization=None, chat_template=None, host='127.0.0.1', port=30000, additional_ports=[30001, 30002, 30003, 30004], mem_fraction_static=0.86, max_prefill_tokens=None, max_running_requests=None, schedule_heuristic='lpm', schedule_conservativeness=1.0, tp_size=4, stream_interval=1, random_seed=41592065, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key='', dp_size=1, load_balance_method='round_robin', disable_flashinfer=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_disk_cache=False, enable_torch_compile=False, attention_reduce_in_fp32=False, enable_p2p_check=False, efficient_weight_load=False, nccl_init_addr=None, nnodes=1, node_rank=None) [gpu_id=0] Init nccl begin. [gpu_id=2] Init nccl begin. [gpu_id=1] Init nccl begin. [gpu_id=3] Init nccl begin. WARNING 07-27 14:04:50 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. WARNING 07-27 14:04:50 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. WARNING 07-27 14:04:50 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. WARNING 07-27 14:04:50 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. [gpu_id=2] Load weight begin. avail mem=21.39 GB [gpu_id=0] Load weight begin. avail mem=21.39 GB [gpu_id=1] Load weight begin. avail mem=21.34 GB [gpu_id=3] Load weight begin. avail mem=21.34 GB Loading safetensors checkpoint shards: 0% Completed | 0/11 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 9% Completed | 1/11 [00:02<00:20, 2.06s/it] Loading safetensors checkpoint shards: 18% Completed | 2/11 [00:06<00:31, 3.46s/it] Loading safetensors checkpoint shards: 27% Completed | 3/11 [00:10<00:30, 3.86s/it] Loading safetensors checkpoint shards: 36% Completed | 4/11 [00:15<00:29, 4.17s/it] Loading safetensors checkpoint shards: 45% Completed | 5/11 [00:20<00:26, 4.38s/it] Loading safetensors checkpoint shards: 55% Completed | 6/11 [00:24<00:21, 4.23s/it] Loading safetensors checkpoint shards: 64% Completed | 7/11 [00:29<00:18, 4.55s/it] Loading safetensors checkpoint shards: 73% Completed | 8/11 [00:34<00:13, 4.62s/it] Loading safetensors checkpoint shards: 82% Completed | 9/11 [00:38<00:09, 4.66s/it] [gpu_id=2] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=11.57 GB Loading safetensors checkpoint shards: 91% Completed | 10/11 [00:42<00:04, 4.20s/it] [gpu_id=3] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=11.52 GB [gpu_id=1] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=11.52 GB Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:42<00:00, 3.00s/it] Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:42<00:00, 3.85s/it]
[gpu_id=0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=11.57 GB [gpu_id=3] Memory pool end. avail mem=2.52 GB [gpu_id=0] Memory pool end. avail mem=2.57 GB [gpu_id=2] Memory pool end. avail mem=2.57 GB [gpu_id=1] Memory pool end. avail mem=2.52 GB [gpu_id=0] Capture cuda graph begin. This can take up to several minutes. [gpu_id=1] Capture cuda graph begin. This can take up to several minutes. [gpu_id=2] Capture cuda graph begin. This can take up to several minutes. [gpu_id=3] Capture cuda graph begin. This can take up to several minutes. CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/attention/handler.cuh: line 170 at function cudaOccupancyMaxActiveBlocksPerMultiprocessor( &num_blocks_per_sm, partition_kv_kernel, num_threads, smem_size) CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/attention/handler.cuh: line 325 at function work_estimation_func(split_kv, max_grid_size, max_num_pages_per_batch, new_batch_size, batch_size, indptr_h, num_qo_heads, pagesize, IsCUDAGraphEnabled(), stream) CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/attention/handler.cuh: line 170 at function cudaOccupancyMaxActiveBlocksPerMultiprocessor( &num_blocks_per_sm, partition_kv_kernel, num_threads, smem_size) CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/attention/handler.cuh: line 325 at function work_estimation_func(split_kv, max_grid_size, max_num_pages_per_batch, new_batch_size, batch_size, indptr_h, num_qo_heads, pagesize, IsCUDAGraphEnabled(), stream) CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/attention/handler.cuh: line 170 at function cudaOccupancyMaxActiveBlocksPerMultiprocessor( &num_blocks_per_sm, partition_kv_kernel, num_threads, smem_size) CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/attention/handler.cuh: line 325 at function work_estimation_func(split_kv, max_grid_size, max_num_pages_per_batch, new_batch_size, batch_size, indptr_h, num_qo_heads, pagesize, IsCUDAGraphEnabled(), stream) CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/attention/handler.cuh: line 170 at function cudaOccupancyMaxActiveBlocksPerMultiprocessor( &num_blocks_per_sm, partition_kv_kernel, num_threads, smem_size) CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/attention/handler.cuh: line 325 at function work_estimation_func(split_kv, max_grid_size, max_num_pages_per_batch, new_batch_size, batch_size, indptr_h, num_qo_heads, pagesize, IsCUDAGraphEnabled(), stream) Initialization failed. controller_init_state: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 262, in init_cuda_graphs self.cuda_graph_runner.capture(batch_size_list) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/cuda_graph_runner.py", line 114, in capture ) = self.capture_one_batch_size(bs, forward) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/cuda_graph_runner.py", line 149, in capture_one_batch_size init_flashinfer_args( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/infer_batch.py", line 884, in init_flashinfer_args flashinfer_decode_wrapper.begin_forward( File "/usr/local/lib/python3.10/dist-packages/flashinfer/decode.py", line 514, in begin_forward self._wrapper.begin_forward( RuntimeError: BatchDecodeWithPagedKVCache failed with error no kernel image is available for execution on the device
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/manager_single.py", line 135, in start_controller_process controller = ControllerSingle( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/manager_single.py", line 69, in init self.tp_server = ModelTpServer( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 72, in init self.model_runner = ModelRunner( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 103, in init self.init_cuda_graphs() File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 264, in init_cuda_graphs raise Exception( Exception: Capture cuda graph failed: BatchDecodeWithPagedKVCache failed with error no kernel image is available for execution on the device. Possible solutions:
Initialization failed. detoken_init_state: init ok Exception in run_tp_server: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 262, in init_cuda_graphs self.cuda_graph_runner.capture(batch_size_list) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/cuda_graph_runner.py", line 114, in capture ) = self.capture_one_batch_size(bs, forward) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/cuda_graph_runner.py", line 149, in capture_one_batch_size init_flashinfer_args( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/infer_batch.py", line 884, in init_flashinfer_args flashinfer_decode_wrapper.begin_forward( File "/usr/local/lib/python3.10/dist-packages/flashinfer/decode.py", line 514, in begin_forward self._wrapper.begin_forward( RuntimeError: BatchDecodeWithPagedKVCache failed with error no kernel image is available for execution on the device
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 739, in run_tp_server model_server = ModelTpServer( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 72, in init self.model_runner = ModelRunner( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 103, in init self.init_cuda_graphs() File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 264, in init_cuda_graphs raise Exception( Exception: Capture cuda graph failed: BatchDecodeWithPagedKVCache failed with error no kernel image is available for execution on the device. Possible solutions:
Exception in run_tp_server: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 262, in init_cuda_graphs self.cuda_graph_runner.capture(batch_size_list) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/cuda_graph_runner.py", line 114, in capture ) = self.capture_one_batch_size(bs, forward) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/cuda_graph_runner.py", line 149, in capture_one_batch_size init_flashinfer_args( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/infer_batch.py", line 884, in init_flashinfer_args flashinfer_decode_wrapper.begin_forward( File "/usr/local/lib/python3.10/dist-packages/flashinfer/decode.py", line 514, in begin_forward self._wrapper.begin_forward( RuntimeError: BatchDecodeWithPagedKVCache failed with error no kernel image is available for execution on the device
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 739, in run_tp_server model_server = ModelTpServer( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 72, in init self.model_runner = ModelRunner( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 103, in init self.init_cuda_graphs() File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 264, in init_cuda_graphs raise Exception( Exception: Capture cuda graph failed: BatchDecodeWithPagedKVCache failed with error no kernel image is available for execution on the device. Possible solutions:
Exception in run_tp_server: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 262, in init_cuda_graphs self.cuda_graph_runner.capture(batch_size_list) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/cuda_graph_runner.py", line 114, in capture ) = self.capture_one_batch_size(bs, forward) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/cuda_graph_runner.py", line 149, in capture_one_batch_size init_flashinfer_args( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/infer_batch.py", line 884, in init_flashinfer_args flashinfer_decode_wrapper.begin_forward( File "/usr/local/lib/python3.10/dist-packages/flashinfer/decode.py", line 514, in begin_forward self._wrapper.begin_forward( RuntimeError: BatchDecodeWithPagedKVCache failed with error no kernel image is available for execution on the device
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 739, in run_tp_server model_server = ModelTpServer( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 72, in init self.model_runner = ModelRunner( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 103, in init self.init_cuda_graphs() File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 264, in init_cuda_graphs raise Exception( Exception: Capture cuda graph failed: BatchDecodeWithPagedKVCache failed with error no kernel image is available for execution on the device. Possible solutions:
Process Process-1:3: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 262, in init_cuda_graphs self.cuda_graph_runner.capture(batch_size_list) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/cuda_graph_runner.py", line 114, in capture ) = self.capture_one_batch_size(bs, forward) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/cuda_graph_runner.py", line 149, in capture_one_batch_size init_flashinfer_args( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/infer_batch.py", line 884, in init_flashinfer_args flashinfer_decode_wrapper.begin_forward( File "/usr/local/lib/python3.10/dist-packages/flashinfer/decode.py", line 514, in begin_forward self._wrapper.begin_forward( RuntimeError: BatchDecodeWithPagedKVCache failed with error no kernel image is available for execution on the device
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 739, in run_tp_server model_server = ModelTpServer( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 72, in init self.model_runner = ModelRunner( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 103, in init self.init_cuda_graphs() File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 264, in init_cuda_graphs raise Exception( Exception: Capture cuda graph failed: BatchDecodeWithPagedKVCache failed with error no kernel image is available for execution on the device. Possible solutions:
Process Process-1:2: Process Process-1:1: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 262, in init_cuda_graphs self.cuda_graph_runner.capture(batch_size_list) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/cuda_graph_runner.py", line 114, in capture ) = self.capture_one_batch_size(bs, forward) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/cuda_graph_runner.py", line 149, in capture_one_batch_size init_flashinfer_args( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/infer_batch.py", line 884, in init_flashinfer_args flashinfer_decode_wrapper.begin_forward( File "/usr/local/lib/python3.10/dist-packages/flashinfer/decode.py", line 514, in begin_forward self._wrapper.begin_forward( RuntimeError: BatchDecodeWithPagedKVCache failed with error no kernel image is available for execution on the device
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 739, in run_tp_server model_server = ModelTpServer( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 72, in init self.model_runner = ModelRunner( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 103, in init self.init_cuda_graphs() File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 264, in init_cuda_graphs raise Exception( Exception: Capture cuda graph failed: BatchDecodeWithPagedKVCache failed with error no kernel image is available for execution on the device. Possible solutions:
Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 262, in init_cuda_graphs self.cuda_graph_runner.capture(batch_size_list) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/cuda_graph_runner.py", line 114, in capture ) = self.capture_one_batch_size(bs, forward) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/cuda_graph_runner.py", line 149, in capture_one_batch_size init_flashinfer_args( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/infer_batch.py", line 884, in init_flashinfer_args flashinfer_decode_wrapper.begin_forward( File "/usr/local/lib/python3.10/dist-packages/flashinfer/decode.py", line 514, in begin_forward self._wrapper.begin_forward( RuntimeError: BatchDecodeWithPagedKVCache failed with error no kernel image is available for execution on the device
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 739, in run_tp_server model_server = ModelTpServer( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 72, in init self.model_runner = ModelRunner( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 103, in init self.init_cuda_graphs() File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 264, in init_cuda_graphs raise Exception( Exception: Capture cuda graph failed: BatchDecodeWithPagedKVCache failed with error no kernel image is available for execution on the device. Possible solutions:
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '
And I add "--mem-fraction-static", then get this: root@b1107d56399c:/sgl-workspace# python3 -m sglang.launch_server --model-path /root/hf_model/Qwen/Qwen2-72B-Instruct-GPTQ-Int4 --dtype half --trust-remote-code --tp-size 4 --disable-cuda-graph server_args=ServerArgs(model_path='/root/hf_model/Qwen/Qwen2-72B-Instruct-GPTQ-Int4', tokenizer_path='/root/hf_model/Qwen/Qwen2-72B-Instruct-GPTQ-Int4', tokenizer_mode='auto', load_format='auto', dtype='half', trust_remote_code=True, context_length=None, quantization=None, chat_template=None, host='127.0.0.1', port=30000, additional_ports=[30001, 30002, 30003, 30004], mem_fraction_static=0.86, max_prefill_tokens=None, max_running_requests=None, schedule_heuristic='lpm', schedule_conservativeness=1.0, tp_size=4, stream_interval=1, random_seed=614925003, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key='', dp_size=1, load_balance_method='round_robin', disable_flashinfer=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=True, disable_disk_cache=False, enable_torch_compile=False, attention_reduce_in_fp32=False, enable_p2p_check=False, efficient_weight_load=False, nccl_init_addr=None, nnodes=1, node_rank=None) [gpu_id=0] Init nccl begin. [gpu_id=1] Init nccl begin. [gpu_id=2] Init nccl begin. [gpu_id=3] Init nccl begin. WARNING 07-27 14:06:21 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. WARNING 07-27 14:06:21 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. WARNING 07-27 14:06:21 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. WARNING 07-27 14:06:21 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. [gpu_id=0] Load weight begin. avail mem=21.39 GB [gpu_id=3] Load weight begin. avail mem=21.34 GB [gpu_id=1] Load weight begin. avail mem=21.34 GB [gpu_id=2] Load weight begin. avail mem=21.39 GB Loading safetensors checkpoint shards: 0% Completed | 0/11 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 9% Completed | 1/11 [00:02<00:25, 2.52s/it] Loading safetensors checkpoint shards: 18% Completed | 2/11 [00:07<00:34, 3.87s/it] Loading safetensors checkpoint shards: 27% Completed | 3/11 [00:12<00:36, 4.54s/it] Loading safetensors checkpoint shards: 36% Completed | 4/11 [00:17<00:33, 4.84s/it] [gpu_id=2] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=11.57 GB Loading safetensors checkpoint shards: 45% Completed | 5/11 [00:21<00:26, 4.38s/it] Loading safetensors checkpoint shards: 55% Completed | 6/11 [00:24<00:19, 3.92s/it] Loading safetensors checkpoint shards: 64% Completed | 7/11 [00:27<00:14, 3.51s/it] Loading safetensors checkpoint shards: 73% Completed | 8/11 [00:29<00:09, 3.25s/it] Loading safetensors checkpoint shards: 82% Completed | 9/11 [00:32<00:06, 3.09s/it] Loading safetensors checkpoint shards: 91% Completed | 10/11 [00:34<00:02, 2.72s/it] [gpu_id=3] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=11.52 GB [gpu_id=1] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=11.52 GB Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:34<00:00, 1.97s/it] Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:34<00:00, 3.16s/it]
[gpu_id=0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=11.57 GB [gpu_id=1] Memory pool end. avail mem=2.52 GB [gpu_id=3] Memory pool end. avail mem=2.52 GB [gpu_id=0] Memory pool end. avail mem=2.57 GB [gpu_id=2] Memory pool end. avail mem=2.57 GB [gpu_id=2] max_total_num_tokens=111822, max_prefill_tokens=16384, max_running_requests=2047, context_len=32768 [gpu_id=0] max_total_num_tokens=111822, max_prefill_tokens=16384, max_running_requests=2047, context_len=32768 [gpu_id=1] max_total_num_tokens=111822, max_prefill_tokens=16384, max_running_requests=2047, context_len=32768 [gpu_id=3] max_total_num_tokens=111822, max_prefill_tokens=16384, max_running_requests=2047, context_len=32768 INFO: Started server process [672] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit) INFO: 127.0.0.1:48978 - "GET /get_model_info HTTP/1.1" 200 OK [gpu_id=0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, cache hit rate: 0.00%, #running-req: 0, #queue-req: 0 CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/attention/prefill.cuh: line 2128 at function cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/attention/prefill.cuh: line 2128 at function cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/attention/prefill.cuh: line 2128 at function cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/attention/prefill.cuh: line 2128 at function cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) Exception in ModelTpServer: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 186, in exposed_step self.forward_step() File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 202, in forward_step self.forward_prefill_batch(new_batch) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 443, in forward_prefill_batch output = self.model_runner.forward(batch, ForwardMode.EXTEND) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 336, in forward return self.forward_extend(batch) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 304, in forward_extend return self.model.forward( File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 272, in forward hidden_states = self.model(input_ids, positions, input_metadata, input_embeds) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 240, in forward hidden_states, residual = layer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 192, in forward hidden_states = self.self_attn( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 141, in forward attn_output = self.attn(q, k, v, input_metadata) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/radix_attention.py", line 149, in forward return self.extend_forward(q, k, v, input_metadata) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/radix_attention.py", line 91, in extend_forward_flashinfer o = input_metadata.flashinfer_prefill_wrapper_paged.forward( File "/usr/local/lib/python3.10/dist-packages/flashinfer/prefill.py", line 875, in forward return self._wrapper.forward( RuntimeError: BatchPrefillWithPagedKVCache failed with error code no kernel image is available for execution on the device
Exception in ModelTpServer: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 186, in exposed_step self.forward_step() File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 202, in forward_step self.forward_prefill_batch(new_batch) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 443, in forward_prefill_batch output = self.model_runner.forward(batch, ForwardMode.EXTEND) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 336, in forward return self.forward_extend(batch) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 304, in forward_extend return self.model.forward( File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 272, in forward hidden_states = self.model(input_ids, positions, input_metadata, input_embeds) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 240, in forward hidden_states, residual = layer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 192, in forward hidden_states = self.self_attn( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 141, in forward attn_output = self.attn(q, k, v, input_metadata) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/radix_attention.py", line 149, in forward return self.extend_forward(q, k, v, input_metadata) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/radix_attention.py", line 91, in extend_forward_flashinfer o = input_metadata.flashinfer_prefill_wrapper_paged.forward( File "/usr/local/lib/python3.10/dist-packages/flashinfer/prefill.py", line 875, in forward return self._wrapper.forward( RuntimeError: BatchPrefillWithPagedKVCache failed with error code no kernel image is available for execution on the device
Exception in run_tp_server: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 750, in run_tp_server model_server.exposed_step(recv_reqs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 186, in exposed_step self.forward_step() File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 202, in forward_step self.forward_prefill_batch(new_batch) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 443, in forward_prefill_batch output = self.model_runner.forward(batch, ForwardMode.EXTEND) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 336, in forward return self.forward_extend(batch) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 304, in forward_extend return self.model.forward( File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 272, in forward hidden_states = self.model(input_ids, positions, input_metadata, input_embeds) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 240, in forward hidden_states, residual = layer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 192, in forward hidden_states = self.self_attn( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 141, in forward attn_output = self.attn(q, k, v, input_metadata) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/radix_attention.py", line 149, in forward return self.extend_forward(q, k, v, input_metadata) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/radix_attention.py", line 91, in extend_forward_flashinfer o = input_metadata.flashinfer_prefill_wrapper_paged.forward( File "/usr/local/lib/python3.10/dist-packages/flashinfer/prefill.py", line 875, in forward return self._wrapper.forward( RuntimeError: BatchPrefillWithPagedKVCache failed with error code no kernel image is available for execution on the device
Exception in run_tp_server: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 750, in run_tp_server model_server.exposed_step(recv_reqs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 186, in exposed_step self.forward_step() File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 202, in forward_step self.forward_prefill_batch(new_batch) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 443, in forward_prefill_batch output = self.model_runner.forward(batch, ForwardMode.EXTEND) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 336, in forward return self.forward_extend(batch) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 304, in forward_extend return self.model.forward( File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 272, in forward hidden_states = self.model(input_ids, positions, input_metadata, input_embeds) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 240, in forward hidden_states, residual = layer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 192, in forward hidden_states = self.self_attn( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 141, in forward attn_output = self.attn(q, k, v, input_metadata) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/radix_attention.py", line 149, in forward return self.extend_forward(q, k, v, input_metadata) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/radix_attention.py", line 91, in extend_forward_flashinfer o = input_metadata.flashinfer_prefill_wrapper_paged.forward( File "/usr/local/lib/python3.10/dist-packages/flashinfer/prefill.py", line 875, in forward return self._wrapper.forward( RuntimeError: BatchPrefillWithPagedKVCache failed with error code no kernel image is available for execution on the device
Exception in ModelTpServer: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 186, in exposed_step self.forward_step() File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 202, in forward_step self.forward_prefill_batch(new_batch) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 443, in forward_prefill_batch output = self.model_runner.forward(batch, ForwardMode.EXTEND) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 336, in forward return self.forward_extend(batch) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 304, in forward_extend return self.model.forward( File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 272, in forward hidden_states = self.model(input_ids, positions, input_metadata, input_embeds) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 240, in forward hidden_states, residual = layer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 192, in forward hidden_states = self.self_attn( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 141, in forward attn_output = self.attn(q, k, v, input_metadata) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/radix_attention.py", line 149, in forward return self.extend_forward(q, k, v, input_metadata) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/radix_attention.py", line 91, in extend_forward_flashinfer o = input_metadata.flashinfer_prefill_wrapper_paged.forward( File "/usr/local/lib/python3.10/dist-packages/flashinfer/prefill.py", line 875, in forward return self._wrapper.forward( RuntimeError: BatchPrefillWithPagedKVCache failed with error code no kernel image is available for execution on the device
Exception in run_tp_server: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 750, in run_tp_server model_server.exposed_step(recv_reqs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 186, in exposed_step self.forward_step() File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 202, in forward_step self.forward_prefill_batch(new_batch) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 443, in forward_prefill_batch output = self.model_runner.forward(batch, ForwardMode.EXTEND) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 336, in forward return self.forward_extend(batch) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 304, in forward_extend return self.model.forward( File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 272, in forward hidden_states = self.model(input_ids, positions, input_metadata, input_embeds) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 240, in forward hidden_states, residual = layer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 192, in forward hidden_states = self.self_attn( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 141, in forward attn_output = self.attn(q, k, v, input_metadata) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/radix_attention.py", line 149, in forward return self.extend_forward(q, k, v, input_metadata) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/radix_attention.py", line 91, in extend_forward_flashinfer o = input_metadata.flashinfer_prefill_wrapper_paged.forward( File "/usr/local/lib/python3.10/dist-packages/flashinfer/prefill.py", line 875, in forward return self._wrapper.forward( RuntimeError: BatchPrefillWithPagedKVCache failed with error code no kernel image is available for execution on the device
Exception in ModelTpServer: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 186, in exposed_step self.forward_step() File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 202, in forward_step self.forward_prefill_batch(new_batch) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 443, in forward_prefill_batch output = self.model_runner.forward(batch, ForwardMode.EXTEND) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 336, in forward return self.forward_extend(batch) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 304, in forward_extend return self.model.forward( File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 272, in forward hidden_states = self.model(input_ids, positions, input_metadata, input_embeds) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 240, in forward hidden_states, residual = layer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 192, in forward hidden_states = self.self_attn( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 141, in forward attn_output = self.attn(q, k, v, input_metadata) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/radix_attention.py", line 149, in forward return self.extend_forward(q, k, v, input_metadata) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/radix_attention.py", line 91, in extend_forward_flashinfer o = input_metadata.flashinfer_prefill_wrapper_paged.forward( File "/usr/local/lib/python3.10/dist-packages/flashinfer/prefill.py", line 875, in forward return self._wrapper.forward( RuntimeError: BatchPrefillWithPagedKVCache failed with error code no kernel image is available for execution on the device
Exception in ControllerSingle: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/manager_single.py", line 151, in start_controller_process controller.loop_for_forward() File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/manager_single.py", line 88, in loop_for_forward out_pyobjs = self.tp_server.exposed_step(recv_reqs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 186, in exposed_step self.forward_step() File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 202, in forward_step self.forward_prefill_batch(new_batch) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 443, in forward_prefill_batch output = self.model_runner.forward(batch, ForwardMode.EXTEND) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 336, in forward return self.forward_extend(batch) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 304, in forward_extend return self.model.forward( File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 272, in forward hidden_states = self.model(input_ids, positions, input_metadata, input_embeds) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 240, in forward hidden_states, residual = layer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 192, in forward hidden_states = self.self_attn( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 141, in forward attn_output = self.attn(q, k, v, input_metadata) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/radix_attention.py", line 149, in forward return self.extend_forward(q, k, v, input_metadata) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/radix_attention.py", line 91, in extend_forward_flashinfer o = input_metadata.flashinfer_prefill_wrapper_paged.forward( File "/usr/local/lib/python3.10/dist-packages/flashinfer/prefill.py", line 875, in forward return self._wrapper.forward( RuntimeError: BatchPrefillWithPagedKVCache failed with error code no kernel image is available for execution on the device
Killed
what can I do? Maybe because flashinfer dosen't suppot sm75?
It is possible. Can you try to disable flashinfer?
python3 -m sglang.launch_server --model-path /root/hf_model/Qwen/Qwen2-72B-Instruct-GPTQ-Int4 --dtype half --trust-remote-code --tp-size 4 --disable-flashinfer
--disable-flashinfer 我用rtx3090 disable flashinfer 出现同样的错误
但是用Qwen2-72B-Instruct-AWQ就好用
Maybe because flashinfer dosen't suppot sm75?
Currently FlashInfer support sm75 ref https://github.com/flashinfer-ai/flashinfer/pull/128
我用rtx3090 disable flashinfer 出现同样的错误
RTX 3090's compute capability is sm8.6. ref https://developer.nvidia.com/cuda-gpus
@bltcn Please paste the env info with python3 -m sglang.check_env
@bltcn Please paste the env info with
python3 -m sglang.check_env
@zhyncs root@0bdb08439b5e:/sgl-workspace# python3 -m sglang.check_env Python: 3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0] CUDA available: True GPU 0,1,2,3,4,5,6,7,8,9: NVIDIA GeForce RTX 2080 Ti CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.4, V12.4.131 CUDA Driver Version: 550.90.07 550.90.07 550.90.07 550.90.07 550.90.07 550.90.07 550.90.07 550.90.07 550.90.07 550.90.07 PyTorch: 2.3.1+cu121 sglang: 0.2.0 flashinfer: 0.1.1+cu121torch2.3 requests: 2.32.3 tqdm: 4.66.4 numpy: 1.26.4 aiohttp: 3.9.5 fastapi: 0.111.1 hf_transfer: 0.1.8 huggingface_hub: 0.24.2 interegular: 0.3.3 packaging: 24.1 pillow: Module Not Found psutil: 6.0.0 pydantic: 2.8.2 uvicorn: 0.30.3 uvloop: 0.19.0 zmq: 26.0.3 vllm: 0.5.3.post1 openai: 1.37.0 anthropic: 0.31.2 NVIDIA Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU8 GPU9 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PIX PIX PIX PIX NODE NODE NODE NODE NV2 0-19,40-59 0 N/A GPU1 PIX X PIX NV2 PIX NODE NODE NODE NODE SYS 0-19,40-59 0 N/A GPU2 PIX PIX X PIX NV2 NODE NODE NODE NODE SYS 0-19,40-59 0 N/A GPU3 PIX NV2 PIX X PIX NODE NODE NODE NODE SYS 0-19,40-59 0 N/A GPU4 PIX PIX NV2 PIX X NODE NODE NODE NODE SYS 0-19,40-59 0 N/A GPU5 NODE NODE NODE NODE NODE X NV2 PIX PIX SYS 0-19,40-59 0 N/A GPU6 NODE NODE NODE NODE NODE NV2 X PIX PIX SYS 0-19,40-59 0 N/A GPU7 NODE NODE NODE NODE NODE PIX PIX X NV2 SYS 0-19,40-59 0 N/A GPU8 NODE NODE NODE NODE NODE PIX PIX NV2 X SYS 0-19,40-59 0 N/A GPU9 NV2 SYS SYS SYS SYS SYS SYS SYS SYS X 20-39,60-79 1 N/A
Legend:
X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
ulimit soft: 1048576
It is possible. Can you try to disable flashinfer?
python3 -m sglang.launch_server --model-path /root/hf_model/Qwen/Qwen2-72B-Instruct-GPTQ-Int4 --dtype half --trust-remote-code --tp-size 4 --disable-flashinfer
something wrong like this:
INFO: 127.0.0.1:41710 - "GET /get_model_info HTTP/1.1" 200 OK [gpu_id=0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, cache hit rate: 0.00%, #running-req: 0, #queue-req: 0 CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/sampling.cuh: line 544 at function cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/sampling.cuh: line 544 at function cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/sampling.cuh: line 544 at function cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/sampling.cuh: line 544 at function cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) Exception in ModelTpServer: Traceback (most recent call last):
It is possible. Can you try to disable flashinfer?
python3 -m sglang.launch_server --model-path /root/hf_model/Qwen/Qwen2-72B-Instruct-GPTQ-Int4 --dtype half --trust-remote-code --tp-size 4 --disable-flashinfer
something wrong like this:
INFO: 127.0.0.1:41710 - "GET /get_model_info HTTP/1.1" 200 OK [gpu_id=0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, cache hit rate: 0.00%, #running-req: 0, #queue-req: 0 CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/sampling.cuh: line 544 at function cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/sampling.cuh: line 544 at function cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/sampling.cuh: line 544 at function cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/sampling.cuh: line 544 at function cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) Exception in ModelTpServer: Traceback (most recent call last):
(sglang-env) liuys@gpu-33:~/sglang$ python3 -m sglang.check_env Python: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0] CUDA available: True GPU 0,1,2,3: NVIDIA GeForce RTX 3090 CUDA_HOME: /usr/local/cuda-12.1 NVCC: Cuda compilation tools, release 12.1, V12.1.105 CUDA Driver Version: 555.42.06 555.42.06 555.42.06 555.42.06 PyTorch: 2.3.1+cu121 sglang: 0.2.6 flashinfer: 0.1.1+cu121torch2.3 requests: 2.32.3 tqdm: 4.66.4 numpy: 1.26.4 aiohttp: 3.9.5 fastapi: 0.111.1 hf_transfer: 0.1.8 huggingface_hub: 0.24.2 interegular: 0.3.3 packaging: 24.1 PIL: 10.4.0 psutil: 5.9.0 pydantic: 2.8.2 uvicorn: 0.30.3 uvloop: 0.19.0 zmq: 25.1.2 vllm: 0.5.3.post1 openai: 1.37.1 anthropic: 0.31.2 NVIDIA Topology: GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PIX PHB PHB 0-15,32-47 0 N/A GPU1 PIX X PHB PHB 0-15,32-47 0 N/A GPU2 PHB PHB X PIX 0-15,32-47 0 N/A GPU3 PHB PHB PIX X 0-15,32-47 0 N/A
Legend:
X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
ulimit soft: 4096
(base) root@DESKTOP-O6DNFE1:/mnt/c/Windows/system32# python3 -m sglang.launch_server --model-path /mnt/e/Code/models/gemma-2-27b-it-gptq-4bit --quantization gptq --context-length 4096 --tp-size 2 --max-running-requests=1 --dtype=half --port 8000 --disable-cuda-graph --mem-fraction-static 0.85
server_args=ServerArgs(model_path='/mnt/e/Code/models/gemma-2-27b-it-gptq-4bit', tokenizer_path='/mnt/e/Code/models/gemma-2-27b-it-gptq-4bit', tokenizer_mode='auto', load_format='auto', dtype='half', trust_remote_code=False, context_length=4096, quantization='gptq', chat_template=None, host='127.0.0.1', port=8000, additional_ports=[8001, 8002, 8003, 8004], mem_fraction_static=0.85, max_prefill_tokens=None, max_running_requests=1, max_num_reqs=None, schedule_heuristic='lpm', schedule_conservativeness=1.0, tp_size=2, stream_interval=1, random_seed=819287122, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key='', dp_size=1, load_balance_method='round_robin', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=True, disable_disk_cache=False, enable_torch_compile=False, enable_p2p_check=False, attention_reduce_in_fp32=False, efficient_weight_load=False, nccl_init_addr=None, nnodes=1, node_rank=None)
[gpu_id=0] Init nccl begin.
[gpu_id=1] Init nccl begin.
[gpu_id=1] Load weight begin. avail mem=20.54 GB
[gpu_id=0] Load weight begin. avail mem=20.54 GB
WARNING 07-30 20:02:57 utils.py:569] Gemma 2 uses sliding window attention for every odd layer, which is currently not supported by vLLM. Disabling sliding window and capping the max length to the sliding window size (4096).
WARNING 07-30 20:02:57 utils.py:569] Gemma 2 uses sliding window attention for every odd layer, which is currently not supported by vLLM. Disabling sliding window and capping the max length to the sliding window size (4096).
WARNING 07-30 20:02:57 interfaces.py:131] The model (<class 'sglang.srt.models.gemma2.Gemma2ForCausalLM'>) contains all LoRA-specific attributes, but does not set `supports_lora=True`.
WARNING 07-30 20:02:57 interfaces.py:131] The model (<class 'sglang.srt.models.gemma2.Gemma2ForCausalLM'>) contains all LoRA-specific attributes, but does not set `supports_lora=True`.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [02:28<00:00, 148.39s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [02:28<00:00, 148.39s/it]
[gpu_id=1] Load weight end. type=Gemma2ForCausalLM, dtype=torch.float16, avail mem=12.78 GB
[gpu_id=0] Load weight end. type=Gemma2ForCausalLM, dtype=torch.float16, avail mem=12.78 GB
[gpu_id=0] Memory pool end. avail mem=2.97 GB
[gpu_id=1] Memory pool end. avail mem=2.97 GB
[gpu_id=0] max_total_num_tokens=55219, max_prefill_tokens=16384, max_running_requests=1, context_len=4096
[gpu_id=1] max_total_num_tokens=55219, max_prefill_tokens=16384, max_running_requests=1, context_len=4096
INFO: Started server process [2600]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
INFO: 127.0.0.1:46012 - "GET /get_model_info HTTP/1.1" 200 OK
[gpu_id=0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, #running-req: 0, #queue-req: 0
CUDA Error: no kernel image is available for execution on the deviceCUDA Error: (209) /tmp/build-via-sdist-6dxbtvfo/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/attention/prefill.cuhno kernel image is available for execution on the device (: line 2092128) at function /tmp/build-via-sdist-6dxbtvfo/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/attention/prefill.cuhcudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size): line 2128
at function cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size)
Exception in ModelTpServer:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/tp_worker.py", line 190, in exposed_step
self.forward_step()
File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/tp_worker.py", line 206, in forward_step
self.forward_prefill_batch(new_batch)
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/tp_worker.py", line 449, in forward_prefill_batch
output = self.model_runner.forward(batch, ForwardMode.EXTEND)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/model_runner.py", line 351, in forward
return self.forward_extend(batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/model_runner.py", line 319, in forward_extend
return self.model.forward(
^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/models/gemma2.py", line 387, in forward
hidden_states = self.model(input_ids, positions, input_metadata, input_embeds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/models/gemma2.py", line 331, in forward
hidden_states, residual = layer(
^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/models/gemma2.py", line 270, in forward
hidden_states = self.self_attn(
^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/models/gemma2.py", line 212, in forward
attn_output = self.attn(q, k, v, input_metadata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/layers/radix_attention.py", line 152, in forward
return self.extend_forward(q, k, v, input_metadata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/layers/radix_attention.py", line 94, in extend_forward_flashinfer
o = input_metadata.flashinfer_prefill_wrapper_paged.forward(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/flashinfer/prefill.py", line 875, in forward
return self._wrapper.forward(
^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: BatchPrefillWithPagedKVCache failed with error code no kernel image is available for execution on the device
Exception in ModelTpServer:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/tp_worker.py", line 190, in exposed_step
self.forward_step()
File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/tp_worker.py", line 206, in forward_step
self.forward_prefill_batch(new_batch)
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/tp_worker.py", line 449, in forward_prefill_batch
output = self.model_runner.forward(batch, ForwardMode.EXTEND)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/model_runner.py", line 351, in forward
return self.forward_extend(batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/model_runner.py", line 319, in forward_extend
return self.model.forward(
^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/models/gemma2.py", line 387, in forward
hidden_states = self.model(input_ids, positions, input_metadata, input_embeds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/models/gemma2.py", line 331, in forward
hidden_states, residual = layer(
^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/models/gemma2.py", line 270, in forward
hidden_states = self.self_attn(
^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/models/gemma2.py", line 212, in forward
attn_output = self.attn(q, k, v, input_metadata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/layers/radix_attention.py", line 152, in forward
return self.extend_forward(q, k, v, input_metadata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/layers/radix_attention.py", line 94, in extend_forward_flashinfer
o = input_metadata.flashinfer_prefill_wrapper_paged.forward(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/flashinfer/prefill.py", line 875, in forward
return self._wrapper.forward(
^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: BatchPrefillWithPagedKVCache failed with error code no kernel image is available for execution on the device
Exception in run_tp_server:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/tp_worker.py", line 756, in run_tp_server
model_server.exposed_step(recv_reqs)
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/tp_worker.py", line 190, in exposed_step
self.forward_step()
File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/tp_worker.py", line 206, in forward_step
self.forward_prefill_batch(new_batch)
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/tp_worker.py", line 449, in forward_prefill_batch
output = self.model_runner.forward(batch, ForwardMode.EXTEND)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/model_runner.py", line 351, in forward
return self.forward_extend(batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/model_runner.py", line 319, in forward_extend
return self.model.forward(
^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/models/gemma2.py", line 387, in forward
hidden_states = self.model(input_ids, positions, input_metadata, input_embeds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/models/gemma2.py", line 331, in forward
hidden_states, residual = layer(
^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/models/gemma2.py", line 270, in forward
hidden_states = self.self_attn(
^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/models/gemma2.py", line 212, in forward
attn_output = self.attn(q, k, v, input_metadata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/layers/radix_attention.py", line 152, in forward
return self.extend_forward(q, k, v, input_metadata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/layers/radix_attention.py", line 94, in extend_forward_flashinfer
o = input_metadata.flashinfer_prefill_wrapper_paged.forward(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/flashinfer/prefill.py", line 875, in forward
return self._wrapper.forward(
^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: BatchPrefillWithPagedKVCache failed with error code no kernel image is available for execution on the device
Exception in ControllerSingle:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/manager_single.py", line 151, in start_controller_process
controller.loop_for_forward()
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/manager_single.py", line 88, in loop_for_forward
out_pyobjs = self.tp_server.exposed_step(recv_reqs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/tp_worker.py", line 190, in exposed_step
self.forward_step()
File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/tp_worker.py", line 206, in forward_step
self.forward_prefill_batch(new_batch)
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/tp_worker.py", line 449, in forward_prefill_batch
output = self.model_runner.forward(batch, ForwardMode.EXTEND)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/model_runner.py", line 351, in forward
return self.forward_extend(batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/model_runner.py", line 319, in forward_extend
return self.model.forward(
^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/models/gemma2.py", line 387, in forward
hidden_states = self.model(input_ids, positions, input_metadata, input_embeds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/models/gemma2.py", line 331, in forward
hidden_states, residual = layer(
^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/models/gemma2.py", line 270, in forward
hidden_states = self.self_attn(
^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/models/gemma2.py", line 212, in forward
attn_output = self.attn(q, k, v, input_metadata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/layers/radix_attention.py", line 152, in forward
return self.extend_forward(q, k, v, input_metadata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/layers/radix_attention.py", line 94, in extend_forward_flashinfer
o = input_metadata.flashinfer_prefill_wrapper_paged.forward(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/flashinfer/prefill.py", line 875, in forward
return self._wrapper.forward(
^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: BatchPrefillWithPagedKVCache failed with error code no kernel image is available for execution on the device
Killed
same err, use 2080ti-22G*2, in WSL2 with nvlink.
Currently, the main development and testing of FlashInfer are primarily on sm80+ such as A100, H100. Theoretically, older CUDA Compatibility is also supported in FlashInfer https://github.com/flashinfer-ai/flashinfer/pull/128. Due to the lack of a corresponding development environment and not having been fully verified on devices like 2080 Ti, we cannot guarantee 100% like with A100, H100. For this situation, you can continue to follow https://github.com/flashinfer-ai/flashinfer/issues/402 and try the method mentioned at https://github.com/sgl-project/sglang/issues/831#issuecomment-2258924113 --disable-flashinfer --disable-flashinfer-sampling
. This issue is closed for now and can be reopened later if needed. We deeply apologize for any inconvenience caused to you. Thank you for your understanding.
Checklist
Describe the bug
Qwen2-72B-Instruct-GPTQ-Int4 Can't run this server
Reproduction
root@4563caaa7539:/sgl-workspace# CUDA_VISIBLE_DEVICES=9,0,2,4 python3 -m sglang.launch_server --model-path /root/hf_model/Qwen/Qwen2-72B-Instruct-GPTQ-Int4 --dtype half --trust-remote-code --tp-size 4 --quantization gptq --log-level INFO --enable-p2p-check --efficient-weight-load --host 0.0.0.0 --log-requests --show-time-cost --disable-disk-cache --enable-torch-compile --mem-fraction-static 0.6 --disable-cuda-graph --max-running-requests 64 --port 30000 server_args=ServerArgs(model_path='/root/hf_model/Qwen/Qwen2-72B-Instruct-GPTQ-Int4', tokenizer_path='/root/hf_model/Qwen/Qwen2-72B-Instruct-GPTQ-Int4', tokenizer_mode='auto', load_format='auto', dtype='half', trust_remote_code=True, context_length=None, quantization='gptq', chat_template=None, host='0.0.0.0', port=30000, additional_ports=[30001, 30002, 30003, 30004], mem_fraction_static=0.6, max_prefill_tokens=None, max_running_requests=64, schedule_heuristic='lpm', schedule_conservativeness=1.0, tp_size=4, stream_interval=1, random_seed=639225566, log_level='INFO', log_level_http=None, log_requests=True, show_time_cost=True, api_key='', dp_size=1, load_balance_method='round_robin', disable_flashinfer=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=True, disable_disk_cache=True, enable_torch_compile=True, attention_reduce_in_fp32=False, enable_p2p_check=True, efficient_weight_load=True, nccl_init_addr=None, nnodes=1, node_rank=None) [gpu_id=0] Init nccl begin. [gpu_id=2] Init nccl begin. [gpu_id=3] Init nccl begin. [gpu_id=1] Init nccl begin. WARNING 07-26 13:13:42 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. WARNING 07-26 13:13:42 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. WARNING 07-26 13:13:42 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. WARNING 07-26 13:13:42 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. [gpu_id=0] Load weight begin. avail mem=21.26 GB [gpu_id=1] Load weight begin. avail mem=21.26 GB [gpu_id=3] Load weight begin. avail mem=21.26 GB [gpu_id=2] Load weight begin. avail mem=21.26 GB Loading safetensors checkpoint shards: 0% Completed | 0/11 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 9% Completed | 1/11 [00:01<00:11, 1.11s/it] Loading safetensors checkpoint shards: 18% Completed | 2/11 [00:05<00:27, 3.08s/it] Loading safetensors checkpoint shards: 27% Completed | 3/11 [00:10<00:32, 4.05s/it] Loading safetensors checkpoint shards: 36% Completed | 4/11 [00:16<00:31, 4.54s/it] Loading safetensors checkpoint shards: 45% Completed | 5/11 [00:20<00:27, 4.62s/it] Loading safetensors checkpoint shards: 55% Completed | 6/11 [00:26<00:24, 4.94s/it] [gpu_id=3] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=11.45 GB Loading safetensors checkpoint shards: 64% Completed | 7/11 [00:31<00:19, 4.90s/it] Loading safetensors checkpoint shards: 73% Completed | 8/11 [00:33<00:12, 4.08s/it] [gpu_id=2] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=11.45 GB Loading safetensors checkpoint shards: 82% Completed | 9/11 [00:35<00:06, 3.36s/it] Loading safetensors checkpoint shards: 91% Completed | 10/11 [00:35<00:02, 2.44s/it] Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:35<00:00, 1.77s/it] Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:35<00:00, 3.27s/it]
[gpu_id=0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=11.45 GB [gpu_id=1] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=11.45 GB [gpu_id=3] Memory pool end. avail mem=8.07 GB [gpu_id=2] Memory pool end. avail mem=8.07 GB [gpu_id=0] Memory pool end. avail mem=8.07 GB [gpu_id=1] Memory pool end. avail mem=8.07 GB [gpu_id=0] max_total_num_tokens=38165, max_prefill_tokens=16384, max_running_requests=64, context_len=32768 [gpu_id=2] max_total_num_tokens=38165, max_prefill_tokens=16384, max_running_requests=64, context_len=32768 [gpu_id=1] max_total_num_tokens=38165, max_prefill_tokens=16384, max_running_requests=64, context_len=32768 [gpu_id=3] max_total_num_tokens=38165, max_prefill_tokens=16384, max_running_requests=64, context_len=32768 Initialization failed. warmup error: HTTPConnectionPool(host='0.0.0.0', port=30000): Max retries exceeded with url: /generate (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f655bc34400>: Failed to establish a new connection: [Errno 111] Connection refused')) Exception in thread Thread-1 (_wait_and_warmup): Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 196, in _new_conn sock = connection.create_connection( File "/usr/local/lib/python3.10/dist-packages/urllib3/util/connection.py", line 85, in create_connection raise err File "/usr/local/lib/python3.10/dist-packages/urllib3/util/connection.py", line 73, in create_connection sock.connect(sa) ConnectionRefusedError: [Errno 111] Connection refused
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 789, in urlopen response = self._make_request( File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 495, in _make_request conn.request( File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 398, in request self.endheaders() File "/usr/lib/python3.10/http/client.py", line 1278, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/usr/lib/python3.10/http/client.py", line 1038, in _send_output self.send(msg) File "/usr/lib/python3.10/http/client.py", line 976, in send self.connect() File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 236, in connect self.sock = self._new_conn() File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 211, in _new_conn raise NewConnectionError( urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f655bc34400>: Failed to establish a new connection: [Errno 111] Connection refused
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/requests/adapters.py", line 667, in send resp = conn.urlopen( File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 843, in urlopen retries = retries.increment( File "/usr/local/lib/python3.10/dist-packages/urllib3/util/retry.py", line 519, in increment raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type] urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='0.0.0.0', port=30000): Max retries exceeded with url: /generate (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f655bc34400>: Failed to establish a new connection: [Errno 111] Connection refused'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/usr/lib/python3.10/threading.py", line 953, in run self._target(*self._args, self._kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/server.py", line 356, in _wait_and_warmup raise e File "/usr/local/lib/python3.10/dist-packages/sglang/srt/server.py", line 339, in _wait_and_warmup res = requests.post( File "/usr/local/lib/python3.10/dist-packages/requests/api.py", line 115, in post return request("post", url, data=data, json=json, kwargs) File "/usr/local/lib/python3.10/dist-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, kwargs) File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 589, in request resp = self.send(prep, send_kwargs) File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 703, in send r = adapter.send(request, **kwargs) File "/usr/local/lib/python3.10/dist-packages/requests/adapters.py", line 700, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPConnectionPool(host='0.0.0.0', port=30000): Max retries exceeded with url: /generate (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f655bc34400>: Failed to establish a new connection: [Errno 111] Connection refused')) Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/sglang/launch_server.py", line 14, in
launch_server(server_args)
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/server.py", line 309, in launch_server
uvicorn.run(
File "/usr/local/lib/python3.10/dist-packages/uvicorn/main.py", line 514, in run
config = Config(
File "/usr/local/lib/python3.10/dist-packages/uvicorn/config.py", line 273, in init
self.configure_logging()
File "/usr/local/lib/python3.10/dist-packages/uvicorn/config.py", line 385, in configure_logging
log_level = LOG_LEVELS[self.log_level]
KeyError: 'INFO'
Environment