[Bug] Can't run server on sm75

bltcn commented 3 months ago

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.
[X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

Qwen2-72B-Instruct-GPTQ-Int4 Can't run this server

Reproduction

root@4563caaa7539:/sgl-workspace# CUDA_VISIBLE_DEVICES=9,0,2,4 python3 -m sglang.launch_server --model-path /root/hf_model/Qwen/Qwen2-72B-Instruct-GPTQ-Int4 --dtype half --trust-remote-code --tp-size 4 --quantization gptq --log-level INFO --enable-p2p-check --efficient-weight-load --host 0.0.0.0 --log-requests --show-time-cost --disable-disk-cache --enable-torch-compile --mem-fraction-static 0.6 --disable-cuda-graph --max-running-requests 64 --port 30000 server_args=ServerArgs(model_path='/root/hf_model/Qwen/Qwen2-72B-Instruct-GPTQ-Int4', tokenizer_path='/root/hf_model/Qwen/Qwen2-72B-Instruct-GPTQ-Int4', tokenizer_mode='auto', load_format='auto', dtype='half', trust_remote_code=True, context_length=None, quantization='gptq', chat_template=None, host='0.0.0.0', port=30000, additional_ports=[30001, 30002, 30003, 30004], mem_fraction_static=0.6, max_prefill_tokens=None, max_running_requests=64, schedule_heuristic='lpm', schedule_conservativeness=1.0, tp_size=4, stream_interval=1, random_seed=639225566, log_level='INFO', log_level_http=None, log_requests=True, show_time_cost=True, api_key='', dp_size=1, load_balance_method='round_robin', disable_flashinfer=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=True, disable_disk_cache=True, enable_torch_compile=True, attention_reduce_in_fp32=False, enable_p2p_check=True, efficient_weight_load=True, nccl_init_addr=None, nnodes=1, node_rank=None) [gpu_id=0] Init nccl begin. [gpu_id=2] Init nccl begin. [gpu_id=3] Init nccl begin. [gpu_id=1] Init nccl begin. WARNING 07-26 13:13:42 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. WARNING 07-26 13:13:42 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. WARNING 07-26 13:13:42 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. WARNING 07-26 13:13:42 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. [gpu_id=0] Load weight begin. avail mem=21.26 GB [gpu_id=1] Load weight begin. avail mem=21.26 GB [gpu_id=3] Load weight begin. avail mem=21.26 GB [gpu_id=2] Load weight begin. avail mem=21.26 GB Loading safetensors checkpoint shards: 0% Completed | 0/11 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 9% Completed | 1/11 [00:01<00:11, 1.11s/it] Loading safetensors checkpoint shards: 18% Completed | 2/11 [00:05<00:27, 3.08s/it] Loading safetensors checkpoint shards: 27% Completed | 3/11 [00:10<00:32, 4.05s/it] Loading safetensors checkpoint shards: 36% Completed | 4/11 [00:16<00:31, 4.54s/it] Loading safetensors checkpoint shards: 45% Completed | 5/11 [00:20<00:27, 4.62s/it] Loading safetensors checkpoint shards: 55% Completed | 6/11 [00:26<00:24, 4.94s/it] [gpu_id=3] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=11.45 GB Loading safetensors checkpoint shards: 64% Completed | 7/11 [00:31<00:19, 4.90s/it] Loading safetensors checkpoint shards: 73% Completed | 8/11 [00:33<00:12, 4.08s/it] [gpu_id=2] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=11.45 GB Loading safetensors checkpoint shards: 82% Completed | 9/11 [00:35<00:06, 3.36s/it] Loading safetensors checkpoint shards: 91% Completed | 10/11 [00:35<00:02, 2.44s/it] Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:35<00:00, 1.77s/it] Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:35<00:00, 3.27s/it]

[gpu_id=0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=11.45 GB [gpu_id=1] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=11.45 GB [gpu_id=3] Memory pool end. avail mem=8.07 GB [gpu_id=2] Memory pool end. avail mem=8.07 GB [gpu_id=0] Memory pool end. avail mem=8.07 GB [gpu_id=1] Memory pool end. avail mem=8.07 GB [gpu_id=0] max_total_num_tokens=38165, max_prefill_tokens=16384, max_running_requests=64, context_len=32768 [gpu_id=2] max_total_num_tokens=38165, max_prefill_tokens=16384, max_running_requests=64, context_len=32768 [gpu_id=1] max_total_num_tokens=38165, max_prefill_tokens=16384, max_running_requests=64, context_len=32768 [gpu_id=3] max_total_num_tokens=38165, max_prefill_tokens=16384, max_running_requests=64, context_len=32768 Initialization failed. warmup error: HTTPConnectionPool(host='0.0.0.0', port=30000): Max retries exceeded with url: /generate (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f655bc34400>: Failed to establish a new connection: [Errno 111] Connection refused')) Exception in thread Thread-1 (_wait_and_warmup): Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 196, in _new_conn sock = connection.create_connection( File "/usr/local/lib/python3.10/dist-packages/urllib3/util/connection.py", line 85, in create_connection raise err File "/usr/local/lib/python3.10/dist-packages/urllib3/util/connection.py", line 73, in create_connection sock.connect(sa) ConnectionRefusedError: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 789, in urlopen response = self._make_request( File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 495, in _make_request conn.request( File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 398, in request self.endheaders() File "/usr/lib/python3.10/http/client.py", line 1278, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/usr/lib/python3.10/http/client.py", line 1038, in _send_output self.send(msg) File "/usr/lib/python3.10/http/client.py", line 976, in send self.connect() File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 236, in connect self.sock = self._new_conn() File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 211, in _new_conn raise NewConnectionError( urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f655bc34400>: Failed to establish a new connection: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/requests/adapters.py", line 667, in send resp = conn.urlopen( File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 843, in urlopen retries = retries.increment( File "/usr/local/lib/python3.10/dist-packages/urllib3/util/retry.py", line 519, in increment raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type] urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='0.0.0.0', port=30000): Max retries exceeded with url: /generate (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f655bc34400>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/usr/lib/python3.10/threading.py", line 953, in run self._target(*self._args, self._kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/server.py", line 356, in _wait_and_warmup raise e File "/usr/local/lib/python3.10/dist-packages/sglang/srt/server.py", line 339, in _wait_and_warmup res = requests.post( File "/usr/local/lib/python3.10/dist-packages/requests/api.py", line 115, in post return request("post", url, data=data, json=json, kwargs) File "/usr/local/lib/python3.10/dist-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, kwargs) File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 589, in request resp = self.send(prep, send_kwargs) File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 703, in send r = adapter.send(request, **kwargs) File "/usr/local/lib/python3.10/dist-packages/requests/adapters.py", line 700, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPConnectionPool(host='0.0.0.0', port=30000): Max retries exceeded with url: /generate (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f655bc34400>: Failed to establish a new connection: [Errno 111] Connection refused')) Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/sglang/launch_server.py", line 14, in launch_server(server_args) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/server.py", line 309, in launch_server uvicorn.run( File "/usr/local/lib/python3.10/dist-packages/uvicorn/main.py", line 514, in run config = Config( File "/usr/local/lib/python3.10/dist-packages/uvicorn/config.py", line 273, in init self.configure_logging() File "/usr/local/lib/python3.10/dist-packages/uvicorn/config.py", line 385, in configure_logging log_level = LOG_LEVELS[self.log_level] KeyError: 'INFO'

Environment

6133×2 512G 2080ti(22G)*10
rocky linux 8.8
docker 26.1.3
lmsysorg/sglang:v0.2.0-cu124
pip install -U "sglang[all]" -i https://pypi.tuna.tsinghua.edu.cn/simple

merrymercy commented 3 months ago

python3 -m sglang.launch_server --model-path /root/hf_model/Qwen/Qwen2-72B-Instruct-GPTQ-Int4 --dtype half --trust-remote-code --tp-size 4

This should be enough to launch the server. Other arguments are not required. I tested it on my machine and it works well.

bltcn commented 3 months ago

sorry, it's not ok. Please notice that my GPU is 2080ti * 4. I use your comand and get this: root@b1107d56399c:/sgl-workspace# python3 -m sglang.launch_server --model-path /root/hf_model/Qwen/Qwen2-72B-Instruct-GPTQ-Int4 --dtype half --trust-remote-code --tp-size 4 server_args=ServerArgs(model_path='/root/hf_model/Qwen/Qwen2-72B-Instruct-GPTQ-Int4', tokenizer_path='/root/hf_model/Qwen/Qwen2-72B-Instruct-GPTQ-Int4', tokenizer_mode='auto', load_format='auto', dtype='half', trust_remote_code=True, context_length=None, quantization=None, chat_template=None, host='127.0.0.1', port=30000, additional_ports=[30001, 30002, 30003, 30004], mem_fraction_static=0.86, max_prefill_tokens=None, max_running_requests=None, schedule_heuristic='lpm', schedule_conservativeness=1.0, tp_size=4, stream_interval=1, random_seed=41592065, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key='', dp_size=1, load_balance_method='round_robin', disable_flashinfer=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_disk_cache=False, enable_torch_compile=False, attention_reduce_in_fp32=False, enable_p2p_check=False, efficient_weight_load=False, nccl_init_addr=None, nnodes=1, node_rank=None) [gpu_id=0] Init nccl begin. [gpu_id=2] Init nccl begin. [gpu_id=1] Init nccl begin. [gpu_id=3] Init nccl begin. WARNING 07-27 14:04:50 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. WARNING 07-27 14:04:50 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. WARNING 07-27 14:04:50 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. WARNING 07-27 14:04:50 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. [gpu_id=2] Load weight begin. avail mem=21.39 GB [gpu_id=0] Load weight begin. avail mem=21.39 GB [gpu_id=1] Load weight begin. avail mem=21.34 GB [gpu_id=3] Load weight begin. avail mem=21.34 GB Loading safetensors checkpoint shards: 0% Completed | 0/11 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 9% Completed | 1/11 [00:02<00:20, 2.06s/it] Loading safetensors checkpoint shards: 18% Completed | 2/11 [00:06<00:31, 3.46s/it] Loading safetensors checkpoint shards: 27% Completed | 3/11 [00:10<00:30, 3.86s/it] Loading safetensors checkpoint shards: 36% Completed | 4/11 [00:15<00:29, 4.17s/it] Loading safetensors checkpoint shards: 45% Completed | 5/11 [00:20<00:26, 4.38s/it] Loading safetensors checkpoint shards: 55% Completed | 6/11 [00:24<00:21, 4.23s/it] Loading safetensors checkpoint shards: 64% Completed | 7/11 [00:29<00:18, 4.55s/it] Loading safetensors checkpoint shards: 73% Completed | 8/11 [00:34<00:13, 4.62s/it] Loading safetensors checkpoint shards: 82% Completed | 9/11 [00:38<00:09, 4.66s/it] [gpu_id=2] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=11.57 GB Loading safetensors checkpoint shards: 91% Completed | 10/11 [00:42<00:04, 4.20s/it] [gpu_id=3] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=11.52 GB [gpu_id=1] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=11.52 GB Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:42<00:00, 3.00s/it] Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:42<00:00, 3.85s/it]

[gpu_id=0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=11.57 GB [gpu_id=3] Memory pool end. avail mem=2.52 GB [gpu_id=0] Memory pool end. avail mem=2.57 GB [gpu_id=2] Memory pool end. avail mem=2.57 GB [gpu_id=1] Memory pool end. avail mem=2.52 GB [gpu_id=0] Capture cuda graph begin. This can take up to several minutes. [gpu_id=1] Capture cuda graph begin. This can take up to several minutes. [gpu_id=2] Capture cuda graph begin. This can take up to several minutes. [gpu_id=3] Capture cuda graph begin. This can take up to several minutes. CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/attention/handler.cuh: line 170 at function cudaOccupancyMaxActiveBlocksPerMultiprocessor( &num_blocks_per_sm, partition_kv_kernel, num_threads, smem_size) CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/attention/handler.cuh: line 325 at function work_estimation_func(split_kv, max_grid_size, max_num_pages_per_batch, new_batch_size, batch_size, indptr_h, num_qo_heads, pagesize, IsCUDAGraphEnabled(), stream) CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/attention/handler.cuh: line 170 at function cudaOccupancyMaxActiveBlocksPerMultiprocessor( &num_blocks_per_sm, partition_kv_kernel, num_threads, smem_size) CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/attention/handler.cuh: line 325 at function work_estimation_func(split_kv, max_grid_size, max_num_pages_per_batch, new_batch_size, batch_size, indptr_h, num_qo_heads, pagesize, IsCUDAGraphEnabled(), stream) CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/attention/handler.cuh: line 170 at function cudaOccupancyMaxActiveBlocksPerMultiprocessor( &num_blocks_per_sm, partition_kv_kernel, num_threads, smem_size) CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/attention/handler.cuh: line 325 at function work_estimation_func(split_kv, max_grid_size, max_num_pages_per_batch, new_batch_size, batch_size, indptr_h, num_qo_heads, pagesize, IsCUDAGraphEnabled(), stream) CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/attention/handler.cuh: line 170 at function cudaOccupancyMaxActiveBlocksPerMultiprocessor( &num_blocks_per_sm, partition_kv_kernel, num_threads, smem_size) CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/attention/handler.cuh: line 325 at function work_estimation_func(split_kv, max_grid_size, max_num_pages_per_batch, new_batch_size, batch_size, indptr_h, num_qo_heads, pagesize, IsCUDAGraphEnabled(), stream) Initialization failed. controller_init_state: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 262, in init_cuda_graphs self.cuda_graph_runner.capture(batch_size_list) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/cuda_graph_runner.py", line 114, in capture ) = self.capture_one_batch_size(bs, forward) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/cuda_graph_runner.py", line 149, in capture_one_batch_size init_flashinfer_args( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/infer_batch.py", line 884, in init_flashinfer_args flashinfer_decode_wrapper.begin_forward( File "/usr/local/lib/python3.10/dist-packages/flashinfer/decode.py", line 514, in begin_forward self._wrapper.begin_forward( RuntimeError: BatchDecodeWithPagedKVCache failed with error no kernel image is available for execution on the device

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/manager_single.py", line 135, in start_controller_process controller = ControllerSingle( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/manager_single.py", line 69, in init self.tp_server = ModelTpServer( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 72, in init self.model_runner = ModelRunner( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 103, in init self.init_cuda_graphs() File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 264, in init_cuda_graphs raise Exception( Exception: Capture cuda graph failed: BatchDecodeWithPagedKVCache failed with error no kernel image is available for execution on the device. Possible solutions:

disable cuda graph by --disable-cuda-graph
set --mem-fraction-static to a smaller value Open an issue on GitHub with reproducible scripts if you need help.

Initialization failed. detoken_init_state: init ok Exception in run_tp_server: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 262, in init_cuda_graphs self.cuda_graph_runner.capture(batch_size_list) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/cuda_graph_runner.py", line 114, in capture ) = self.capture_one_batch_size(bs, forward) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/cuda_graph_runner.py", line 149, in capture_one_batch_size init_flashinfer_args( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/infer_batch.py", line 884, in init_flashinfer_args flashinfer_decode_wrapper.begin_forward( File "/usr/local/lib/python3.10/dist-packages/flashinfer/decode.py", line 514, in begin_forward self._wrapper.begin_forward( RuntimeError: BatchDecodeWithPagedKVCache failed with error no kernel image is available for execution on the device

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 739, in run_tp_server model_server = ModelTpServer( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 72, in init self.model_runner = ModelRunner( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 103, in init self.init_cuda_graphs() File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 264, in init_cuda_graphs raise Exception( Exception: Capture cuda graph failed: BatchDecodeWithPagedKVCache failed with error no kernel image is available for execution on the device. Possible solutions:

disable cuda graph by --disable-cuda-graph
set --mem-fraction-static to a smaller value Open an issue on GitHub with reproducible scripts if you need help.

Exception in run_tp_server: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 262, in init_cuda_graphs self.cuda_graph_runner.capture(batch_size_list) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/cuda_graph_runner.py", line 114, in capture ) = self.capture_one_batch_size(bs, forward) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/cuda_graph_runner.py", line 149, in capture_one_batch_size init_flashinfer_args( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/infer_batch.py", line 884, in init_flashinfer_args flashinfer_decode_wrapper.begin_forward( File "/usr/local/lib/python3.10/dist-packages/flashinfer/decode.py", line 514, in begin_forward self._wrapper.begin_forward( RuntimeError: BatchDecodeWithPagedKVCache failed with error no kernel image is available for execution on the device

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 739, in run_tp_server model_server = ModelTpServer( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 72, in init self.model_runner = ModelRunner( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 103, in init self.init_cuda_graphs() File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 264, in init_cuda_graphs raise Exception( Exception: Capture cuda graph failed: BatchDecodeWithPagedKVCache failed with error no kernel image is available for execution on the device. Possible solutions:

disable cuda graph by --disable-cuda-graph
set --mem-fraction-static to a smaller value Open an issue on GitHub with reproducible scripts if you need help.

Exception in run_tp_server: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 262, in init_cuda_graphs self.cuda_graph_runner.capture(batch_size_list) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/cuda_graph_runner.py", line 114, in capture ) = self.capture_one_batch_size(bs, forward) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/cuda_graph_runner.py", line 149, in capture_one_batch_size init_flashinfer_args( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/infer_batch.py", line 884, in init_flashinfer_args flashinfer_decode_wrapper.begin_forward( File "/usr/local/lib/python3.10/dist-packages/flashinfer/decode.py", line 514, in begin_forward self._wrapper.begin_forward( RuntimeError: BatchDecodeWithPagedKVCache failed with error no kernel image is available for execution on the device

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 739, in run_tp_server model_server = ModelTpServer( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 72, in init self.model_runner = ModelRunner( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 103, in init self.init_cuda_graphs() File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 264, in init_cuda_graphs raise Exception( Exception: Capture cuda graph failed: BatchDecodeWithPagedKVCache failed with error no kernel image is available for execution on the device. Possible solutions:

disable cuda graph by --disable-cuda-graph
set --mem-fraction-static to a smaller value Open an issue on GitHub with reproducible scripts if you need help.

Process Process-1:3: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 262, in init_cuda_graphs self.cuda_graph_runner.capture(batch_size_list) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/cuda_graph_runner.py", line 114, in capture ) = self.capture_one_batch_size(bs, forward) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/cuda_graph_runner.py", line 149, in capture_one_batch_size init_flashinfer_args( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/infer_batch.py", line 884, in init_flashinfer_args flashinfer_decode_wrapper.begin_forward( File "/usr/local/lib/python3.10/dist-packages/flashinfer/decode.py", line 514, in begin_forward self._wrapper.begin_forward( RuntimeError: BatchDecodeWithPagedKVCache failed with error no kernel image is available for execution on the device

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 739, in run_tp_server model_server = ModelTpServer( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 72, in init self.model_runner = ModelRunner( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 103, in init self.init_cuda_graphs() File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 264, in init_cuda_graphs raise Exception( Exception: Capture cuda graph failed: BatchDecodeWithPagedKVCache failed with error no kernel image is available for execution on the device. Possible solutions:

disable cuda graph by --disable-cuda-graph
set --mem-fraction-static to a smaller value Open an issue on GitHub with reproducible scripts if you need help.

Process Process-1:2: Process Process-1:1: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 262, in init_cuda_graphs self.cuda_graph_runner.capture(batch_size_list) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/cuda_graph_runner.py", line 114, in capture ) = self.capture_one_batch_size(bs, forward) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/cuda_graph_runner.py", line 149, in capture_one_batch_size init_flashinfer_args( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/infer_batch.py", line 884, in init_flashinfer_args flashinfer_decode_wrapper.begin_forward( File "/usr/local/lib/python3.10/dist-packages/flashinfer/decode.py", line 514, in begin_forward self._wrapper.begin_forward( RuntimeError: BatchDecodeWithPagedKVCache failed with error no kernel image is available for execution on the device

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 739, in run_tp_server model_server = ModelTpServer( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 72, in init self.model_runner = ModelRunner( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 103, in init self.init_cuda_graphs() File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 264, in init_cuda_graphs raise Exception( Exception: Capture cuda graph failed: BatchDecodeWithPagedKVCache failed with error no kernel image is available for execution on the device. Possible solutions:

disable cuda graph by --disable-cuda-graph
set --mem-fraction-static to a smaller value Open an issue on GitHub with reproducible scripts if you need help.

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 262, in init_cuda_graphs self.cuda_graph_runner.capture(batch_size_list) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/cuda_graph_runner.py", line 114, in capture ) = self.capture_one_batch_size(bs, forward) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/cuda_graph_runner.py", line 149, in capture_one_batch_size init_flashinfer_args( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/infer_batch.py", line 884, in init_flashinfer_args flashinfer_decode_wrapper.begin_forward( File "/usr/local/lib/python3.10/dist-packages/flashinfer/decode.py", line 514, in begin_forward self._wrapper.begin_forward( RuntimeError: BatchDecodeWithPagedKVCache failed with error no kernel image is available for execution on the device

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 739, in run_tp_server model_server = ModelTpServer( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 72, in init self.model_runner = ModelRunner( File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 103, in init self.init_cuda_graphs() File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 264, in init_cuda_graphs raise Exception( Exception: Capture cuda graph failed: BatchDecodeWithPagedKVCache failed with error no kernel image is available for execution on the device. Possible solutions:

disable cuda graph by --disable-cuda-graph
set --mem-fraction-static to a smaller value Open an issue on GitHub with reproducible scripts if you need help.

/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '

And I add "--mem-fraction-static", then get this: root@b1107d56399c:/sgl-workspace# python3 -m sglang.launch_server --model-path /root/hf_model/Qwen/Qwen2-72B-Instruct-GPTQ-Int4 --dtype half --trust-remote-code --tp-size 4 --disable-cuda-graph server_args=ServerArgs(model_path='/root/hf_model/Qwen/Qwen2-72B-Instruct-GPTQ-Int4', tokenizer_path='/root/hf_model/Qwen/Qwen2-72B-Instruct-GPTQ-Int4', tokenizer_mode='auto', load_format='auto', dtype='half', trust_remote_code=True, context_length=None, quantization=None, chat_template=None, host='127.0.0.1', port=30000, additional_ports=[30001, 30002, 30003, 30004], mem_fraction_static=0.86, max_prefill_tokens=None, max_running_requests=None, schedule_heuristic='lpm', schedule_conservativeness=1.0, tp_size=4, stream_interval=1, random_seed=614925003, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key='', dp_size=1, load_balance_method='round_robin', disable_flashinfer=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=True, disable_disk_cache=False, enable_torch_compile=False, attention_reduce_in_fp32=False, enable_p2p_check=False, efficient_weight_load=False, nccl_init_addr=None, nnodes=1, node_rank=None) [gpu_id=0] Init nccl begin. [gpu_id=1] Init nccl begin. [gpu_id=2] Init nccl begin. [gpu_id=3] Init nccl begin. WARNING 07-27 14:06:21 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. WARNING 07-27 14:06:21 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. WARNING 07-27 14:06:21 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. WARNING 07-27 14:06:21 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. [gpu_id=0] Load weight begin. avail mem=21.39 GB [gpu_id=3] Load weight begin. avail mem=21.34 GB [gpu_id=1] Load weight begin. avail mem=21.34 GB [gpu_id=2] Load weight begin. avail mem=21.39 GB Loading safetensors checkpoint shards: 0% Completed | 0/11 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 9% Completed | 1/11 [00:02<00:25, 2.52s/it] Loading safetensors checkpoint shards: 18% Completed | 2/11 [00:07<00:34, 3.87s/it] Loading safetensors checkpoint shards: 27% Completed | 3/11 [00:12<00:36, 4.54s/it] Loading safetensors checkpoint shards: 36% Completed | 4/11 [00:17<00:33, 4.84s/it] [gpu_id=2] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=11.57 GB Loading safetensors checkpoint shards: 45% Completed | 5/11 [00:21<00:26, 4.38s/it] Loading safetensors checkpoint shards: 55% Completed | 6/11 [00:24<00:19, 3.92s/it] Loading safetensors checkpoint shards: 64% Completed | 7/11 [00:27<00:14, 3.51s/it] Loading safetensors checkpoint shards: 73% Completed | 8/11 [00:29<00:09, 3.25s/it] Loading safetensors checkpoint shards: 82% Completed | 9/11 [00:32<00:06, 3.09s/it] Loading safetensors checkpoint shards: 91% Completed | 10/11 [00:34<00:02, 2.72s/it] [gpu_id=3] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=11.52 GB [gpu_id=1] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=11.52 GB Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:34<00:00, 1.97s/it] Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:34<00:00, 3.16s/it]

[gpu_id=0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=11.57 GB [gpu_id=1] Memory pool end. avail mem=2.52 GB [gpu_id=3] Memory pool end. avail mem=2.52 GB [gpu_id=0] Memory pool end. avail mem=2.57 GB [gpu_id=2] Memory pool end. avail mem=2.57 GB [gpu_id=2] max_total_num_tokens=111822, max_prefill_tokens=16384, max_running_requests=2047, context_len=32768 [gpu_id=0] max_total_num_tokens=111822, max_prefill_tokens=16384, max_running_requests=2047, context_len=32768 [gpu_id=1] max_total_num_tokens=111822, max_prefill_tokens=16384, max_running_requests=2047, context_len=32768 [gpu_id=3] max_total_num_tokens=111822, max_prefill_tokens=16384, max_running_requests=2047, context_len=32768 INFO: Started server process [672] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit) INFO: 127.0.0.1:48978 - "GET /get_model_info HTTP/1.1" 200 OK [gpu_id=0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, cache hit rate: 0.00%, #running-req: 0, #queue-req: 0 CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/attention/prefill.cuh: line 2128 at function cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/attention/prefill.cuh: line 2128 at function cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/attention/prefill.cuh: line 2128 at function cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/attention/prefill.cuh: line 2128 at function cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) Exception in ModelTpServer: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 186, in exposed_step self.forward_step() File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 202, in forward_step self.forward_prefill_batch(new_batch) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 443, in forward_prefill_batch output = self.model_runner.forward(batch, ForwardMode.EXTEND) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 336, in forward return self.forward_extend(batch) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 304, in forward_extend return self.model.forward( File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 272, in forward hidden_states = self.model(input_ids, positions, input_metadata, input_embeds) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 240, in forward hidden_states, residual = layer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 192, in forward hidden_states = self.self_attn( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 141, in forward attn_output = self.attn(q, k, v, input_metadata) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/radix_attention.py", line 149, in forward return self.extend_forward(q, k, v, input_metadata) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/radix_attention.py", line 91, in extend_forward_flashinfer o = input_metadata.flashinfer_prefill_wrapper_paged.forward( File "/usr/local/lib/python3.10/dist-packages/flashinfer/prefill.py", line 875, in forward return self._wrapper.forward( RuntimeError: BatchPrefillWithPagedKVCache failed with error code no kernel image is available for execution on the device

Exception in ModelTpServer: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 186, in exposed_step self.forward_step() File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 202, in forward_step self.forward_prefill_batch(new_batch) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 443, in forward_prefill_batch output = self.model_runner.forward(batch, ForwardMode.EXTEND) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 336, in forward return self.forward_extend(batch) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 304, in forward_extend return self.model.forward( File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 272, in forward hidden_states = self.model(input_ids, positions, input_metadata, input_embeds) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 240, in forward hidden_states, residual = layer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 192, in forward hidden_states = self.self_attn( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 141, in forward attn_output = self.attn(q, k, v, input_metadata) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/radix_attention.py", line 149, in forward return self.extend_forward(q, k, v, input_metadata) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/radix_attention.py", line 91, in extend_forward_flashinfer o = input_metadata.flashinfer_prefill_wrapper_paged.forward( File "/usr/local/lib/python3.10/dist-packages/flashinfer/prefill.py", line 875, in forward return self._wrapper.forward( RuntimeError: BatchPrefillWithPagedKVCache failed with error code no kernel image is available for execution on the device

Exception in run_tp_server: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 750, in run_tp_server model_server.exposed_step(recv_reqs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 186, in exposed_step self.forward_step() File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 202, in forward_step self.forward_prefill_batch(new_batch) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 443, in forward_prefill_batch output = self.model_runner.forward(batch, ForwardMode.EXTEND) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 336, in forward return self.forward_extend(batch) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 304, in forward_extend return self.model.forward( File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 272, in forward hidden_states = self.model(input_ids, positions, input_metadata, input_embeds) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 240, in forward hidden_states, residual = layer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 192, in forward hidden_states = self.self_attn( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 141, in forward attn_output = self.attn(q, k, v, input_metadata) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/radix_attention.py", line 149, in forward return self.extend_forward(q, k, v, input_metadata) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/radix_attention.py", line 91, in extend_forward_flashinfer o = input_metadata.flashinfer_prefill_wrapper_paged.forward( File "/usr/local/lib/python3.10/dist-packages/flashinfer/prefill.py", line 875, in forward return self._wrapper.forward( RuntimeError: BatchPrefillWithPagedKVCache failed with error code no kernel image is available for execution on the device

Exception in ModelTpServer: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 186, in exposed_step self.forward_step() File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 202, in forward_step self.forward_prefill_batch(new_batch) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 443, in forward_prefill_batch output = self.model_runner.forward(batch, ForwardMode.EXTEND) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 336, in forward return self.forward_extend(batch) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 304, in forward_extend return self.model.forward( File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 272, in forward hidden_states = self.model(input_ids, positions, input_metadata, input_embeds) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 240, in forward hidden_states, residual = layer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 192, in forward hidden_states = self.self_attn( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 141, in forward attn_output = self.attn(q, k, v, input_metadata) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/radix_attention.py", line 149, in forward return self.extend_forward(q, k, v, input_metadata) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/radix_attention.py", line 91, in extend_forward_flashinfer o = input_metadata.flashinfer_prefill_wrapper_paged.forward( File "/usr/local/lib/python3.10/dist-packages/flashinfer/prefill.py", line 875, in forward return self._wrapper.forward( RuntimeError: BatchPrefillWithPagedKVCache failed with error code no kernel image is available for execution on the device

Exception in run_tp_server: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 750, in run_tp_server model_server.exposed_step(recv_reqs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 186, in exposed_step self.forward_step() File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 202, in forward_step self.forward_prefill_batch(new_batch) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 443, in forward_prefill_batch output = self.model_runner.forward(batch, ForwardMode.EXTEND) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 336, in forward return self.forward_extend(batch) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 304, in forward_extend return self.model.forward( File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 272, in forward hidden_states = self.model(input_ids, positions, input_metadata, input_embeds) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 240, in forward hidden_states, residual = layer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 192, in forward hidden_states = self.self_attn( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 141, in forward attn_output = self.attn(q, k, v, input_metadata) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/radix_attention.py", line 149, in forward return self.extend_forward(q, k, v, input_metadata) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/radix_attention.py", line 91, in extend_forward_flashinfer o = input_metadata.flashinfer_prefill_wrapper_paged.forward( File "/usr/local/lib/python3.10/dist-packages/flashinfer/prefill.py", line 875, in forward return self._wrapper.forward( RuntimeError: BatchPrefillWithPagedKVCache failed with error code no kernel image is available for execution on the device

Exception in ModelTpServer: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 186, in exposed_step self.forward_step() File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 202, in forward_step self.forward_prefill_batch(new_batch) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 443, in forward_prefill_batch output = self.model_runner.forward(batch, ForwardMode.EXTEND) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 336, in forward return self.forward_extend(batch) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 304, in forward_extend return self.model.forward( File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 272, in forward hidden_states = self.model(input_ids, positions, input_metadata, input_embeds) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 240, in forward hidden_states, residual = layer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 192, in forward hidden_states = self.self_attn( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 141, in forward attn_output = self.attn(q, k, v, input_metadata) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/radix_attention.py", line 149, in forward return self.extend_forward(q, k, v, input_metadata) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/radix_attention.py", line 91, in extend_forward_flashinfer o = input_metadata.flashinfer_prefill_wrapper_paged.forward( File "/usr/local/lib/python3.10/dist-packages/flashinfer/prefill.py", line 875, in forward return self._wrapper.forward( RuntimeError: BatchPrefillWithPagedKVCache failed with error code no kernel image is available for execution on the device

Exception in ControllerSingle: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/manager_single.py", line 151, in start_controller_process controller.loop_for_forward() File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/manager_single.py", line 88, in loop_for_forward out_pyobjs = self.tp_server.exposed_step(recv_reqs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 186, in exposed_step self.forward_step() File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 202, in forward_step self.forward_prefill_batch(new_batch) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/tp_worker.py", line 443, in forward_prefill_batch output = self.model_runner.forward(batch, ForwardMode.EXTEND) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 336, in forward return self.forward_extend(batch) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/controller/model_runner.py", line 304, in forward_extend return self.model.forward( File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 272, in forward hidden_states = self.model(input_ids, positions, input_metadata, input_embeds) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 240, in forward hidden_states, residual = layer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 192, in forward hidden_states = self.self_attn( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py", line 141, in forward attn_output = self.attn(q, k, v, input_metadata) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/radix_attention.py", line 149, in forward return self.extend_forward(q, k, v, input_metadata) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/radix_attention.py", line 91, in extend_forward_flashinfer o = input_metadata.flashinfer_prefill_wrapper_paged.forward( File "/usr/local/lib/python3.10/dist-packages/flashinfer/prefill.py", line 875, in forward return self._wrapper.forward( RuntimeError: BatchPrefillWithPagedKVCache failed with error code no kernel image is available for execution on the device

Killed

what can I do? Maybe because flashinfer dosen't suppot sm75?

merrymercy commented 3 months ago

It is possible. Can you try to disable flashinfer?

python3 -m sglang.launch_server --model-path /root/hf_model/Qwen/Qwen2-72B-Instruct-GPTQ-Int4 --dtype half --trust-remote-code --tp-size 4 --disable-flashinfer

bravelll commented 3 months ago

--disable-flashinfer 我用rtx3090 disable flashinfer 出现同样的错误

bravelll commented 3 months ago

但是用Qwen2-72B-Instruct-AWQ就好用

zhyncs commented 3 months ago

Maybe because flashinfer dosen't suppot sm75?

Currently FlashInfer support sm75 ref https://github.com/flashinfer-ai/flashinfer/pull/128

zhyncs commented 3 months ago

我用rtx3090 disable flashinfer 出现同样的错误

RTX 3090's compute capability is sm8.6. ref https://developer.nvidia.com/cuda-gpus

zhyncs commented 3 months ago

@bltcn Please paste the env info with python3 -m sglang.check_env

bltcn commented 3 months ago

@bltcn Please paste the env info with python3 -m sglang.check_env

@zhyncs root@0bdb08439b5e:/sgl-workspace# python3 -m sglang.check_env Python: 3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0] CUDA available: True GPU 0,1,2,3,4,5,6,7,8,9: NVIDIA GeForce RTX 2080 Ti CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.4, V12.4.131 CUDA Driver Version: 550.90.07 550.90.07 550.90.07 550.90.07 550.90.07 550.90.07 550.90.07 550.90.07 550.90.07 550.90.07 PyTorch: 2.3.1+cu121 sglang: 0.2.0 flashinfer: 0.1.1+cu121torch2.3 requests: 2.32.3 tqdm: 4.66.4 numpy: 1.26.4 aiohttp: 3.9.5 fastapi: 0.111.1 hf_transfer: 0.1.8 huggingface_hub: 0.24.2 interegular: 0.3.3 packaging: 24.1 pillow: Module Not Found psutil: 6.0.0 pydantic: 2.8.2 uvicorn: 0.30.3 uvloop: 0.19.0 zmq: 26.0.3 vllm: 0.5.3.post1 openai: 1.37.0 anthropic: 0.31.2 NVIDIA Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU8 GPU9 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PIX PIX PIX PIX NODE NODE NODE NODE NV2 0-19,40-59 0 N/A GPU1 PIX X PIX NV2 PIX NODE NODE NODE NODE SYS 0-19,40-59 0 N/A GPU2 PIX PIX X PIX NV2 NODE NODE NODE NODE SYS 0-19,40-59 0 N/A GPU3 PIX NV2 PIX X PIX NODE NODE NODE NODE SYS 0-19,40-59 0 N/A GPU4 PIX PIX NV2 PIX X NODE NODE NODE NODE SYS 0-19,40-59 0 N/A GPU5 NODE NODE NODE NODE NODE X NV2 PIX PIX SYS 0-19,40-59 0 N/A GPU6 NODE NODE NODE NODE NODE NV2 X PIX PIX SYS 0-19,40-59 0 N/A GPU7 NODE NODE NODE NODE NODE PIX PIX X NV2 SYS 0-19,40-59 0 N/A GPU8 NODE NODE NODE NODE NODE PIX PIX NV2 X SYS 0-19,40-59 0 N/A GPU9 NV2 SYS SYS SYS SYS SYS SYS SYS SYS X 20-39,60-79 1 N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

ulimit soft: 1048576

bltcn commented 3 months ago

It is possible. Can you try to disable flashinfer?

python3 -m sglang.launch_server --model-path /root/hf_model/Qwen/Qwen2-72B-Instruct-GPTQ-Int4 --dtype half --trust-remote-code --tp-size 4 --disable-flashinfer

something wrong like this:

INFO: 127.0.0.1:41710 - "GET /get_model_info HTTP/1.1" 200 OK [gpu_id=0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, cache hit rate: 0.00%, #running-req: 0, #queue-req: 0 CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/sampling.cuh: line 544 at function cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/sampling.cuh: line 544 at function cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/sampling.cuh: line 544 at function cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/sampling.cuh: line 544 at function cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) Exception in ModelTpServer: Traceback (most recent call last):

lys791227 commented 3 months ago

It is possible. Can you try to disable flashinfer?
python3 -m sglang.launch_server --model-path /root/hf_model/Qwen/Qwen2-72B-Instruct-GPTQ-Int4 --dtype half --trust-remote-code --tp-size 4 --disable-flashinfer
something wrong like this:

INFO: 127.0.0.1:41710 - "GET /get_model_info HTTP/1.1" 200 OK [gpu_id=0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, cache hit rate: 0.00%, #running-req: 0, #queue-req: 0 CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/sampling.cuh: line 544 at function cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/sampling.cuh: line 544 at function cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/sampling.cuh: line 544 at function cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) CUDA Error: no kernel image is available for execution on the device (209) /tmp/build-via-sdist-akw2qk94/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/sampling.cuh: line 544 at function cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) Exception in ModelTpServer: Traceback (most recent call last):

(sglang-env) liuys@gpu-33:~/sglang$ python3 -m sglang.check_env Python: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0] CUDA available: True GPU 0,1,2,3: NVIDIA GeForce RTX 3090 CUDA_HOME: /usr/local/cuda-12.1 NVCC: Cuda compilation tools, release 12.1, V12.1.105 CUDA Driver Version: 555.42.06 555.42.06 555.42.06 555.42.06 PyTorch: 2.3.1+cu121 sglang: 0.2.6 flashinfer: 0.1.1+cu121torch2.3 requests: 2.32.3 tqdm: 4.66.4 numpy: 1.26.4 aiohttp: 3.9.5 fastapi: 0.111.1 hf_transfer: 0.1.8 huggingface_hub: 0.24.2 interegular: 0.3.3 packaging: 24.1 PIL: 10.4.0 psutil: 5.9.0 pydantic: 2.8.2 uvicorn: 0.30.3 uvloop: 0.19.0 zmq: 25.1.2 vllm: 0.5.3.post1 openai: 1.37.1 anthropic: 0.31.2 NVIDIA Topology: GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PIX PHB PHB 0-15,32-47 0 N/A GPU1 PIX X PHB PHB 0-15,32-47 0 N/A GPU2 PHB PHB X PIX 0-15,32-47 0 N/A GPU3 PHB PHB PIX X 0-15,32-47 0 N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

ulimit soft: 4096

zhyncs commented 3 months ago

ref https://github.com/flashinfer-ai/flashinfer/issues/402

HelloCard commented 3 months ago

(base) root@DESKTOP-O6DNFE1:/mnt/c/Windows/system32# python3 -m sglang.launch_server --model-path /mnt/e/Code/models/gemma-2-27b-it-gptq-4bit --quantization gptq --context-length 4096 --tp-size 2 --max-running-requests=1 --dtype=half --port 8000  --disable-cuda-graph --mem-fraction-static 0.85
server_args=ServerArgs(model_path='/mnt/e/Code/models/gemma-2-27b-it-gptq-4bit', tokenizer_path='/mnt/e/Code/models/gemma-2-27b-it-gptq-4bit', tokenizer_mode='auto', load_format='auto', dtype='half', trust_remote_code=False, context_length=4096, quantization='gptq', chat_template=None, host='127.0.0.1', port=8000, additional_ports=[8001, 8002, 8003, 8004], mem_fraction_static=0.85, max_prefill_tokens=None, max_running_requests=1, max_num_reqs=None, schedule_heuristic='lpm', schedule_conservativeness=1.0, tp_size=2, stream_interval=1, random_seed=819287122, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key='', dp_size=1, load_balance_method='round_robin', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=True, disable_disk_cache=False, enable_torch_compile=False, enable_p2p_check=False, attention_reduce_in_fp32=False, efficient_weight_load=False, nccl_init_addr=None, nnodes=1, node_rank=None)
[gpu_id=0] Init nccl begin.
[gpu_id=1] Init nccl begin.
[gpu_id=1] Load weight begin. avail mem=20.54 GB
[gpu_id=0] Load weight begin. avail mem=20.54 GB
WARNING 07-30 20:02:57 utils.py:569] Gemma 2 uses sliding window attention for every odd layer, which is currently not supported by vLLM. Disabling sliding window and capping the max length to the sliding window size (4096).
WARNING 07-30 20:02:57 utils.py:569] Gemma 2 uses sliding window attention for every odd layer, which is currently not supported by vLLM. Disabling sliding window and capping the max length to the sliding window size (4096).
WARNING 07-30 20:02:57 interfaces.py:131] The model (<class 'sglang.srt.models.gemma2.Gemma2ForCausalLM'>) contains all LoRA-specific attributes, but does not set `supports_lora=True`.
WARNING 07-30 20:02:57 interfaces.py:131] The model (<class 'sglang.srt.models.gemma2.Gemma2ForCausalLM'>) contains all LoRA-specific attributes, but does not set `supports_lora=True`.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [02:28<00:00, 148.39s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [02:28<00:00, 148.39s/it]

[gpu_id=1] Load weight end. type=Gemma2ForCausalLM, dtype=torch.float16, avail mem=12.78 GB
[gpu_id=0] Load weight end. type=Gemma2ForCausalLM, dtype=torch.float16, avail mem=12.78 GB
[gpu_id=0] Memory pool end. avail mem=2.97 GB
[gpu_id=1] Memory pool end. avail mem=2.97 GB
[gpu_id=0] max_total_num_tokens=55219, max_prefill_tokens=16384, max_running_requests=1, context_len=4096
[gpu_id=1] max_total_num_tokens=55219, max_prefill_tokens=16384, max_running_requests=1, context_len=4096
INFO:     Started server process [2600]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
INFO:     127.0.0.1:46012 - "GET /get_model_info HTTP/1.1" 200 OK
[gpu_id=0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, #running-req: 0, #queue-req: 0
CUDA Error: no kernel image is available for execution on the deviceCUDA Error:  (209) /tmp/build-via-sdist-6dxbtvfo/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/attention/prefill.cuhno kernel image is available for execution on the device (: line 2092128)  at function /tmp/build-via-sdist-6dxbtvfo/flashinfer-0.1.1+cu121torch2.3/include/flashinfer/attention/prefill.cuhcudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size): line 2128
 at function cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size)
Exception in ModelTpServer:
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/tp_worker.py", line 190, in exposed_step
    self.forward_step()
  File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/tp_worker.py", line 206, in forward_step
    self.forward_prefill_batch(new_batch)
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/tp_worker.py", line 449, in forward_prefill_batch
    output = self.model_runner.forward(batch, ForwardMode.EXTEND)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/model_runner.py", line 351, in forward
    return self.forward_extend(batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/model_runner.py", line 319, in forward_extend
    return self.model.forward(
           ^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/models/gemma2.py", line 387, in forward
    hidden_states = self.model(input_ids, positions, input_metadata, input_embeds)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/models/gemma2.py", line 331, in forward
    hidden_states, residual = layer(
                              ^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/models/gemma2.py", line 270, in forward
    hidden_states = self.self_attn(
                    ^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/models/gemma2.py", line 212, in forward
    attn_output = self.attn(q, k, v, input_metadata)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/layers/radix_attention.py", line 152, in forward
    return self.extend_forward(q, k, v, input_metadata)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/layers/radix_attention.py", line 94, in extend_forward_flashinfer
    o = input_metadata.flashinfer_prefill_wrapper_paged.forward(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/flashinfer/prefill.py", line 875, in forward
    return self._wrapper.forward(
           ^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: BatchPrefillWithPagedKVCache failed with error code no kernel image is available for execution on the device

Exception in ModelTpServer:
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/tp_worker.py", line 190, in exposed_step
    self.forward_step()
  File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/tp_worker.py", line 206, in forward_step
    self.forward_prefill_batch(new_batch)
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/tp_worker.py", line 449, in forward_prefill_batch
    output = self.model_runner.forward(batch, ForwardMode.EXTEND)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/model_runner.py", line 351, in forward
    return self.forward_extend(batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/model_runner.py", line 319, in forward_extend
    return self.model.forward(
           ^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/models/gemma2.py", line 387, in forward
    hidden_states = self.model(input_ids, positions, input_metadata, input_embeds)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/models/gemma2.py", line 331, in forward
    hidden_states, residual = layer(
                              ^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/models/gemma2.py", line 270, in forward
    hidden_states = self.self_attn(
                    ^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/models/gemma2.py", line 212, in forward
    attn_output = self.attn(q, k, v, input_metadata)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/layers/radix_attention.py", line 152, in forward
    return self.extend_forward(q, k, v, input_metadata)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/layers/radix_attention.py", line 94, in extend_forward_flashinfer
    o = input_metadata.flashinfer_prefill_wrapper_paged.forward(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/flashinfer/prefill.py", line 875, in forward
    return self._wrapper.forward(
           ^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: BatchPrefillWithPagedKVCache failed with error code no kernel image is available for execution on the device

Exception in run_tp_server:
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/tp_worker.py", line 756, in run_tp_server
    model_server.exposed_step(recv_reqs)
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/tp_worker.py", line 190, in exposed_step
    self.forward_step()
  File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/tp_worker.py", line 206, in forward_step
    self.forward_prefill_batch(new_batch)
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/tp_worker.py", line 449, in forward_prefill_batch
    output = self.model_runner.forward(batch, ForwardMode.EXTEND)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/model_runner.py", line 351, in forward
    return self.forward_extend(batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/model_runner.py", line 319, in forward_extend
    return self.model.forward(
           ^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/models/gemma2.py", line 387, in forward
    hidden_states = self.model(input_ids, positions, input_metadata, input_embeds)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/models/gemma2.py", line 331, in forward
    hidden_states, residual = layer(
                              ^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/models/gemma2.py", line 270, in forward
    hidden_states = self.self_attn(
                    ^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/models/gemma2.py", line 212, in forward
    attn_output = self.attn(q, k, v, input_metadata)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/layers/radix_attention.py", line 152, in forward
    return self.extend_forward(q, k, v, input_metadata)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/layers/radix_attention.py", line 94, in extend_forward_flashinfer
    o = input_metadata.flashinfer_prefill_wrapper_paged.forward(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/flashinfer/prefill.py", line 875, in forward
    return self._wrapper.forward(
           ^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: BatchPrefillWithPagedKVCache failed with error code no kernel image is available for execution on the device

Exception in ControllerSingle:
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/manager_single.py", line 151, in start_controller_process
    controller.loop_for_forward()
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/manager_single.py", line 88, in loop_for_forward
    out_pyobjs = self.tp_server.exposed_step(recv_reqs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/tp_worker.py", line 190, in exposed_step
    self.forward_step()
  File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/tp_worker.py", line 206, in forward_step
    self.forward_prefill_batch(new_batch)
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/tp_worker.py", line 449, in forward_prefill_batch
    output = self.model_runner.forward(batch, ForwardMode.EXTEND)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/model_runner.py", line 351, in forward
    return self.forward_extend(batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/model_runner.py", line 319, in forward_extend
    return self.model.forward(
           ^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/models/gemma2.py", line 387, in forward
    hidden_states = self.model(input_ids, positions, input_metadata, input_embeds)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/models/gemma2.py", line 331, in forward
    hidden_states, residual = layer(
                              ^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/models/gemma2.py", line 270, in forward
    hidden_states = self.self_attn(
                    ^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/models/gemma2.py", line 212, in forward
    attn_output = self.attn(q, k, v, input_metadata)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/layers/radix_attention.py", line 152, in forward
    return self.extend_forward(q, k, v, input_metadata)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/layers/radix_attention.py", line 94, in extend_forward_flashinfer
    o = input_metadata.flashinfer_prefill_wrapper_paged.forward(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/flashinfer/prefill.py", line 875, in forward
    return self._wrapper.forward(
           ^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: BatchPrefillWithPagedKVCache failed with error code no kernel image is available for execution on the device

Killed

same err, use 2080ti-22G*2, in WSL2 with nvlink.

zhyncs commented 3 months ago

Currently, the main development and testing of FlashInfer are primarily on sm80+ such as A100, H100. Theoretically, older CUDA Compatibility is also supported in FlashInfer https://github.com/flashinfer-ai/flashinfer/pull/128. Due to the lack of a corresponding development environment and not having been fully verified on devices like 2080 Ti, we cannot guarantee 100% like with A100, H100. For this situation, you can continue to follow https://github.com/flashinfer-ai/flashinfer/issues/402 and try the method mentioned at https://github.com/sgl-project/sglang/issues/831#issuecomment-2258924113 --disable-flashinfer --disable-flashinfer-sampling. This issue is closed for now and can be reopened later if needed. We deeply apologize for any inconvenience caused to you. Thank you for your understanding.

sgl-project / sglang