sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sglang.readthedocs.io/en/latest/
Apache License 2.0
5.7k stars 455 forks source link

[Bug] Unable to use gptq or awq with torch.compile (8*A40) #1522

Open smallstepman opened 3 weeks ago

smallstepman commented 3 weeks ago

Checklist

Describe the bug

can't use -enable-torch-compile in tandem with --dp, always reports either OOM or not enough memory (see two examples below). On purpose, I picked one of the smallest models (0.5B), and GPU with a lot of VRAM (A40 has 48gb), despite that, it still doesn't work.

happy to help to hunt this down

Reproduction

1

root@c670148f30c4:~# python -m sglang.launch_server --host 0.0.0.0 --port 30000 --model-path Qwen/Qwen2.5-0.5B-Instruct-AWQ --dp 8 --enable-p2p-check --mem-fraction-static 0.05 --enable-torch-compile
[16:52:58] server_args=ServerArgs(model_path='Qwen/Qwen2.5-0.5B-Instruct-AWQ', tokenizer_path='Qwen/Qwen2.5-0.5B-Instruct-AWQ', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', dtype='auto', kv_cache_dtype='auto', trust_remote_code=False, context_length=None, quantization=None, served_model_name='Qwen/Qwen2.5-0.5B-Instruct-AWQ', chat_template=None, is_embedding=False, host='0.0.0.0', port=30000, additional_ports=[30001, 30002, 30003, 30004, 30005, 30006, 30007, 30008, 30009, 30010, 30011], mem_fraction_static=0.05, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=795931686, constrained_json_whitespace_pattern=None, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', dp_size=8, load_balance_method='round_robin', nccl_init_addr=None, nnodes=1, node_rank=None, json_model_override_args='{}', attention_backend='flashinfer', sampling_backend='flashinfer', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, enable_mixed_chunk=False, enable_torch_compile=True, max_torch_compile_bs=32, torchao_config='', enable_p2p_check=True, triton_attention_reduce_in_fp32=False, lora_paths=None, max_loras_per_batch=8)
[16:53:00 DP0 TP0] Init nccl begin.
[16:53:00 DP0 TP0] Load weight begin. avail mem=44.09 GB
INFO 09-26 16:53:00 awq_marlin.py:89] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
[16:53:02 DP0 TP0] lm_eval is not installed, GPTQ may not be usable
INFO 09-26 16:53:02 weight_utils.py:236] Using model weights format ['*.safetensors']
INFO 09-26 16:53:02 weight_utils.py:280] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.94it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.94it/s]

[16:53:03 DP0 TP0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=43.33 GB
[16:53:03 DP0 TP0] Memory pool end. avail mem=41.58 GB
[16:53:03 DP0 TP0] Capture cuda graph begin. This can take up to several minutes.
Process Process-1:1:
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 151, in __init__
    self.capture()
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 180, in capture
    ) = self.capture_one_batch_size(bs, forward)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 221, in capture_one_batch_size
    run_once()
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 215, in run_once
    return forward(input_ids, input_metadata.positions, input_metadata)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/eval_frame.py", line 433, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/convert_frame.py", line 1116, in __call__
    return self._torchdynamo_orig_callable(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/convert_frame.py", line 948, in __call__
    result = self._inner_convert(
             ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/convert_frame.py", line 472, in __call__
    return _compile(
           ^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_utils_internal.py", line 84, in wrapper_function
    return StrobelightCompileTimeProfiler.profile_compile_time(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_strobelight/compile_time_profiler.py", line 129, in profile_compile_time
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/convert_frame.py", line 817, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/utils.py", line 231, in time_wrapper
    r = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/convert_frame.py", line 636, in compile_inner
    out_code = transform_code_object(code, transform)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/bytecode_transformation.py", line 1185, in transform_code_object
    transformations(instructions, code_options)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/convert_frame.py", line 178, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/convert_frame.py", line 582, in transform
    tracer.run()
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2451, in run
    super().run()
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 893, in run
    while self.step():
          ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 805, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 499, in wrapper
    return inner_fn(self, inst)
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 1500, in CALL_FUNCTION_EX
    self.call_function(fn, argsvars.items, kwargsvars)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 743, in call_function
    self.push(fn.call_function(self, args, kwargs))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/lazy.py", line 132, in realize_and_forward
    return getattr(self.realize(), name)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/functions.py", line 293, in call_function
    return super().call_function(tx, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/functions.py", line 90, in call_function
    return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 749, in inline_user_function_return
    return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2666, in inline_call
    return cls.inline_call_(parent, func, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2782, in inline_call_
    tracer.run()
...
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2059, in CALL
    self.call_function(fn, args, kwargs)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 743, in call_function
    self.push(fn.call_function(self, args, kwargs))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  ...
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/layers/linear.py", line 375, in forward
    output_parallel = self.quant_method.apply(self, input_, bias)
  File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 262, in apply
    return apply_awq_marlin_linear(
  File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 289, in apply_awq_marlin_linear
    output = ops.gptq_marlin_gemm(reshaped_x,
  File "/usr/local/lib/python3.11/dist-packages/vllm/_custom_ops.py", line 28, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/vllm/_custom_ops.py", line 317, in gptq_marlin_gemm
    return torch.ops._C.gptq_marlin_gemm(a, b_q_weight, b_scales, b_zeros,

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

Possible solutions:
1. disable cuda graph by --disable-cuda-graph
2. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
3. disable torch compile by not using --enable-torch-compile
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose 

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.11/dist-packages/sglang/launch_server.py", line 16, in <module>
    raise e
  File "/usr/local/lib/python3.11/dist-packages/sglang/launch_server.py", line 14, in <module>
    launch_server(server_args)
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/server.py", line 373, in launch_server
    raise RuntimeError(
RuntimeError: Initialization failed. controller_init_state: Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/managers/controller_multi.py", line 195, in start_controller_process
    controller = ControllerMulti(server_args, port_args)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/managers/controller_multi.py", line 98, in __init__
    self.start_dp_worker(i)
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/managers/controller_multi.py", line 125, in start_dp_worker
    raise RuntimeError(
RuntimeError: Initialization failed. controller_init_state: Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 151, in __init__
    self.capture()
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 180, in capture
    ) = self.capture_one_batch_size(bs, forward)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 221, in capture_one_batch_size
    run_once()
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 215, in run_once
    return forward(input_ids, input_metadata.positions, input_metadata)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/eval_frame.py", line 433, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/convert_frame.py", line 1116, in __call__
    return self._torchdynamo_orig_callable(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/convert_frame.py", line 948, in __call__
    result = self._inner_convert(
             ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/convert_frame.py", line 472, in __call__
    return _compile(
           ^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_utils_internal.py", line 84, in wrapper_function
    return StrobelightCompileTimeProfiler.profile_compile_time(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_strobelight/compile_time_profiler.py", line 129, in profile_compile_time
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/convert_frame.py", line 817, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/utils.py", line 231, in time_wrapper
    r = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
...
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2666, in inline_call
    return cls.inline_call_(parent, func, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2782, in inline_call_
    tracer.run()
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 893, in run
    while self.step():
          ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 805, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 499, in wrapper
    return inner_fn(self, inst)
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 1500, in CALL_FUNCTION_EX
    self.call_function(fn, argsvars.items, kwargsvars)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 743, in call_function
    self.push(fn.call_function(self, args, kwargs))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/functions.py", line 344, in call_function
    return super().call_function(tx, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/functions.py", line 293, in call_function
    return super().call_function(tx, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/functions.py", line 90, in call_function
    return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 749, in inline_user_function_return
    return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2666, in inline_call
    return cls.inline_call_(parent, func, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2782, in inline_call_
    tracer.run()
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 893, in run
    while self.step():
          ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 805, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 499, in wrapper
    return inner_fn(self, inst)
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2059, in CALL
    self.call_function(fn, args, kwargs)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 743, in call_function
    self.push(fn.call_function(self, args, kwargs))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/nn_module.py", line 437, in call_function
    return tx.inline_user_function_return(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 749, in inline_user_function_return
    return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2666, in inline_call
    return cls.inline_call_(parent, func, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2782, in inline_call_
    tracer.run()
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 893, in run
    while self.step():
          ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 805, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 499, in wrapper
    return inner_fn(self, inst)
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 1500, in CALL_FUNCTION_EX
    self.call_function(fn, argsvars.items, kwargsvars)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 743, in call_function
    self.push(fn.call_function(self, args, kwargs))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/functions.py", line 344, in call_function
    return super().call_function(tx, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/functions.py", line 293, in call_function
    return super().call_function(tx, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/functions.py", line 90, in call_function
    return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 749, in inline_user_function_return
    return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2666, in inline_call
    return cls.inline_call_(parent, func, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2782, in inline_call_
    tracer.run()
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 893, in run
    while self.step():
          ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 805, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 499, in wrapper
    return inner_fn(self, inst)
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2059, in CALL
    self.call_function(fn, args, kwargs)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 743, in call_function
    self.push(fn.call_function(self, args, kwargs))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/functions.py", line 344, in call_function
    return super().call_function(tx, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/functions.py", line 293, in call_function
    return super().call_function(tx, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/functions.py", line 90, in call_function
    return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 749, in inline_user_function_return
    return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2666, in inline_call
    return cls.inline_call_(parent, func, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2782, in inline_call_
    tracer.run()
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 893, in run
    while self.step():
          ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 805, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 499, in wrapper
    return inner_fn(self, inst)
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2059, in CALL
    self.call_function(fn, args, kwargs)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 743, in call_function
    self.push(fn.call_function(self, args, kwargs))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/functions.py", line 293, in call_function
    return super().call_function(tx, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/functions.py", line 90, in call_function
    return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 749, in inline_user_function_return
    return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2666, in inline_call
    return cls.inline_call_(parent, func, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2782, in inline_call_
    tracer.run()
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 893, in run
    while self.step():
          ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 805, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 499, in wrapper
    return inner_fn(self, inst)
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2059, in CALL
    self.call_function(fn, args, kwargs)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 743, in call_function
    self.push(fn.call_function(self, args, kwargs))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/functions.py", line 293, in call_function
    return super().call_function(tx, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/functions.py", line 90, in call_function
    return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 749, in inline_user_function_return
    return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2666, in inline_call
    return cls.inline_call_(parent, func, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2782, in inline_call_
    tracer.run()
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 893, in run
    while self.step():
          ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 805, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 499, in wrapper
    return inner_fn(self, inst)
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 1500, in CALL_FUNCTION_EX
    self.call_function(fn, argsvars.items, kwargsvars)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 743, in call_function
    self.push(fn.call_function(self, args, kwargs))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/functions.py", line 293, in call_function
    return super().call_function(tx, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/functions.py", line 90, in call_function
    return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 749, in inline_user_function_return
    return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2666, in inline_call
    return cls.inline_call_(parent, func, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2782, in inline_call_
    tracer.run()
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 893, in run
    while self.step():
          ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 805, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 499, in wrapper
    return inner_fn(self, inst)
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2059, in CALL
    self.call_function(fn, args, kwargs)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 743, in call_function
    self.push(fn.call_function(self, args, kwargs))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/torch.py", line 757, in call_function
    tensor_variable = wrap_fx_proxy(
                      ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/builder.py", line 1713, in wrap_fx_proxy
    return wrap_fx_proxy_cls(target_cls=TensorVariable, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/builder.py", line 1798, in wrap_fx_proxy_cls
    example_value = get_fake_value(proxy.node, tx, allow_non_graph_fake=True)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/utils.py", line 1853, in get_fake_value
    raise TorchRuntimeError(str(e)).with_traceback(e.__traceback__) from None
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/utils.py", line 1785, in get_fake_value
    ret_val = wrap_fake_exception(
              ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/utils.py", line 1300, in wrap_fake_exception
    return fn()
           ^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/utils.py", line 1786, in <lambda>
    lambda: run_node(tx.output, node, args, kwargs, nnmodule)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/utils.py", line 1921, in run_node
    raise RuntimeError(make_error_message(e)).with_traceback(
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/utils.py", line 1903, in run_node
    return node.target(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_ops.py", line 1060, in __call__
    return _call_overload_packet_from_python(self_, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_ops.py", line 1098, in _call_overload_packet_from_python
    return found_op(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_ops.py", line 900, in __call__
    return self_._dispatch_in_python(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_ops.py", line 940, in _dispatch_in_python
    return handler(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_ops.py", line 746, in handler
    return torch._library.utils.handle_dispatch_mode(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_library/utils.py", line 244, in handle_dispatch_mode
    return curr_mode.__torch_dispatch__(op_overload, overload_types, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/utils/_stats.py", line 21, in wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_subclasses/fake_tensor.py", line 1061, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_subclasses/fake_tensor.py", line 1450, in dispatch
    return self._cached_dispatch_impl(func, types, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_subclasses/fake_tensor.py", line 1153, in _cached_dispatch_impl
    output = self._dispatch_impl(func, types, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_subclasses/fake_tensor.py", line 1694, in _dispatch_impl
    r = func.decompose(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_ops.py", line 704, in decompose
    return self._op_dk(dk, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch._dynamo.exc.TorchRuntimeError: Failed running call_function _C.gptq_marlin_gemm(*(FakeTensor(..., device='cuda:0', size=(1, 896), dtype=torch.float16), Parameter(FakeTensor(..., device='cuda:0', size=(56, 2304), dtype=torch.int32)), Parameter(FakeTensor(..., device='cuda:0', size=(7, 1152), dtype=torch.float16)), Parameter(FakeTensor(..., device='cuda:0', size=(7, 144), dtype=torch.int32)), Parameter(FakeTensor(..., device='cuda:0', size=(0,), dtype=torch.int32)), Parameter(FakeTensor(..., device='cuda:0', size=(0,), dtype=torch.int32)), FakeTensor(..., device='cuda:0', size=(288,), dtype=torch.int32), <torch._library.fake_class_registry.FakeScriptObject object at 0x7f8e2d214590>, 1, 1152, 896, True, True, True), **{}):
_C::gptq_marlin_gemm() Expected a value of type '__torch__.torch.classes._core_C.ScalarType (of Python compilation unit at: 0)' for argument '_7' but instead found type 'FakeScriptObject'.
Position: 7
Value: <torch._library.fake_class_registry.FakeScriptObject object at 0x7f8e2d214590>
Declaration: _C::gptq_marlin_gemm(Tensor _0, Tensor _1, Tensor _2, Tensor _3, Tensor _4, Tensor _5, Tensor _6, __torch__.torch.classes._core_C.ScalarType _7, int _8, int _9, int _10, bool _11, bool _12, bool _13) -> Tensor _0
Cast error details: Tried to cast object to type __torch__.torch.classes._core_C.ScalarType (of Python compilation unit at: 0) but object was missing attribute capsule

from user code:
   File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/external_utils.py", line 38, in inner
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/models/qwen2.py", line 290, in forward
    hidden_states = self.model(input_ids, positions, input_metadata, input_embeds)
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/models/qwen2.py", line 256, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/models/qwen2.py", line 208, in forward
    hidden_states = self.self_attn(
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/models/qwen2.py", line 154, in forward
    qkv, _ = self.qkv_proj(hidden_states)
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/layers/linear.py", line 375, in forward
    output_parallel = self.quant_method.apply(self, input_, bias)
  File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 262, in apply
    return apply_awq_marlin_linear(
  File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 289, in apply_awq_marlin_linear
    output = ops.gptq_marlin_gemm(reshaped_x,
  File "/usr/local/lib/python3.11/dist-packages/vllm/_custom_ops.py", line 28, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/vllm/_custom_ops.py", line 317, in gptq_marlin_gemm
    return torch.ops._C.gptq_marlin_gemm(a, b_q_weight, b_scales, b_zeros,

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/managers/controller_single.py", line 145, in start_controller_process
    controller = ControllerSingle(
                 ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/managers/controller_single.py", line 81, in __init__
    self.tp_server = ModelTpServer(
                     ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/managers/tp_worker.py", line 100, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/model_executor/model_runner.py", line 128, in __init__
    self.init_cuda_graphs()
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/model_executor/model_runner.py", line 468, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 153, in __init__
    raise Exception(
Exception: Capture cuda graph failed: Failed running call_function _C.gptq_marlin_gemm(*(FakeTensor(..., device='cuda:0', size=(1, 896), dtype=torch.float16), Parameter(FakeTensor(..., device='cuda:0', size=(56, 2304), dtype=torch.int32)), Parameter(FakeTensor(..., device='cuda:0', size=(7, 1152), dtype=torch.float16)), Parameter(FakeTensor(..., device='cuda:0', size=(7, 144), dtype=torch.int32)), Parameter(FakeTensor(..., device='cuda:0', size=(0,), dtype=torch.int32)), Parameter(FakeTensor(..., device='cuda:0', size=(0,), dtype=torch.int32)), FakeTensor(..., device='cuda:0', size=(288,), dtype=torch.int32), <torch._library.fake_class_registry.FakeScriptObject object at 0x7f8e2d214590>, 1, 1152, 896, True, True, True), **{}):
_C::gptq_marlin_gemm() Expected a value of type '__torch__.torch.classes._core_C.ScalarType (of Python compilation unit at: 0)' for argument '_7' but instead found type 'FakeScriptObject'.
Position: 7
Value: <torch._library.fake_class_registry.FakeScriptObject object at 0x7f8e2d214590>
Declaration: _C::gptq_marlin_gemm(Tensor _0, Tensor _1, Tensor _2, Tensor _3, Tensor _4, Tensor _5, Tensor _6, __torch__.torch.classes._core_C.ScalarType _7, int _8, int _9, int _10, bool _11, bool _12, bool _13) -> Tensor _0
Cast error details: Tried to cast object to type __torch__.torch.classes._core_C.ScalarType (of Python compilation unit at: 0) but object was missing attribute capsule

from user code:
   File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/external_utils.py", line 38, in inner
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/models/qwen2.py", line 290, in forward
    hidden_states = self.model(input_ids, positions, input_metadata, input_embeds)
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/models/qwen2.py", line 256, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/models/qwen2.py", line 208, in forward
    hidden_states = self.self_attn(
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/models/qwen2.py", line 154, in forward
    qkv, _ = self.qkv_proj(hidden_states)
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/layers/linear.py", line 375, in forward
    output_parallel = self.quant_method.apply(self, input_, bias)
  File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 262, in apply
    return apply_awq_marlin_linear(
  File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 289, in apply_awq_marlin_linear
    output = ops.gptq_marlin_gemm(reshaped_x,
  File "/usr/local/lib/python3.11/dist-packages/vllm/_custom_ops.py", line 28, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/vllm/_custom_ops.py", line 317, in gptq_marlin_gemm
    return torch.ops._C.gptq_marlin_gemm(a, b_q_weight, b_scales, b_zeros,

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

Possible solutions:
1. disable cuda graph by --disable-cuda-graph
2. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
3. disable torch compile by not using --enable-torch-compile
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose 

, detoken_init_state: init ok

2

root@c670148f30c4:~# python -m sglang.launch_server --host 0.0.0.0 --port 30000 --model-path Qwen/Qwen2.5-0.5B-Instruct-AWQ --dp 8 --enable-p2p-check --mem-fraction-static 0.01 --enable-torch-compile
[16:53:16] server_args=ServerArgs(model_path='Qwen/Qwen2.5-0.5B-Instruct-AWQ', tokenizer_path='Qwen/Qwen2.5-0.5B-Instruct-AWQ', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', dtype='auto', kv_cache_dtype='auto', trust_remote_code=False, context_length=None, quantization=None, served_model_name='Qwen/Qwen2.5-0.5B-Instruct-AWQ', chat_template=None, is_embedding=False, host='0.0.0.0', port=30000, additional_ports=[30001, 30002, 30003, 30004, 30005, 30006, 30007, 30008, 30009, 30010, 30011], mem_fraction_static=0.01, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=567523481, constrained_json_whitespace_pattern=None, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', dp_size=8, load_balance_method='round_robin', nccl_init_addr=None, nnodes=1, node_rank=None, json_model_override_args='{}', attention_backend='flashinfer', sampling_backend='flashinfer', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, enable_mixed_chunk=False, enable_torch_compile=True, max_torch_compile_bs=32, torchao_config='', enable_p2p_check=True, triton_attention_reduce_in_fp32=False, lora_paths=None, max_loras_per_batch=8)
[16:53:18 DP0 TP0] Init nccl begin.
[16:53:18 DP0 TP0] Load weight begin. avail mem=44.09 GB
INFO 09-26 16:53:18 awq_marlin.py:89] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
[16:53:19 DP0 TP0] lm_eval is not installed, GPTQ may not be usable
INFO 09-26 16:53:20 weight_utils.py:236] Using model weights format ['*.safetensors']
INFO 09-26 16:53:20 weight_utils.py:280] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.24it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.23it/s]

[16:53:20 DP0 TP0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=43.33 GB
Process Process-1:1:
Traceback (most recent call last):
  File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/managers/controller_single.py", line 145, in start_controller_process
    controller = ControllerSingle(
                 ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/managers/controller_single.py", line 81, in __init__
    self.tp_server = ModelTpServer(
                     ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/managers/tp_worker.py", line 100, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/model_executor/model_runner.py", line 121, in __init__
    self.init_memory_pool(
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/model_executor/model_runner.py", line 387, in init_memory_pool
    raise RuntimeError(
RuntimeError: Not enough memory. Please try to increase --mem-fraction-static.
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.11/dist-packages/sglang/launch_server.py", line 16, in <module>
    raise e
  File "/usr/local/lib/python3.11/dist-packages/sglang/launch_server.py", line 14, in <module>
    launch_server(server_args)
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/server.py", line 373, in launch_server
    raise RuntimeError(
RuntimeError: Initialization failed. controller_init_state: Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/managers/controller_multi.py", line 195, in start_controller_process
    controller = ControllerMulti(server_args, port_args)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/managers/controller_multi.py", line 98, in __init__
    self.start_dp_worker(i)
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/managers/controller_multi.py", line 125, in start_dp_worker
    raise RuntimeError(
RuntimeError: Initialization failed. controller_init_state: Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/managers/controller_single.py", line 145, in start_controller_process
    controller = ControllerSingle(
                 ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/managers/controller_single.py", line 81, in __init__
    self.tp_server = ModelTpServer(
                     ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/managers/tp_worker.py", line 100, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/model_executor/model_runner.py", line 121, in __init__
    self.init_memory_pool(
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/model_executor/model_runner.py", line 387, in init_memory_pool
    raise RuntimeError(
RuntimeError: Not enough memory. Please try to increase --mem-fraction-static.

, detoken_init_state: init ok

Environment

host: runpod.io gpu: 8*A40 OS image: RunPod Pytorch 2.4.0 runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04

root@c670148f30c4:~# python -c "import sglang; print(sglang.__version__)"
0.3.2

root@c670148f30c4:~# nvidia-smi
Thu Sep 26 16:56:56 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.4     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A40                     On  | 00000000:4F:00.0 Off |                    0 |
|  0%   27C    P8              21W / 300W |      0MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A40                     On  | 00000000:52:00.0 Off |                    0 |
|  0%   29C    P8              21W / 300W |      0MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A40                     On  | 00000000:53:00.0 Off |                    0 |
|  0%   30C    P8              28W / 300W |      0MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A40                     On  | 00000000:56:00.0 Off |                    0 |
|  0%   30C    P8              32W / 300W |      0MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A40                     On  | 00000000:57:00.0 Off |                    0 |
|  0%   29C    P8              22W / 300W |      0MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A40                     On  | 00000000:CE:00.0 Off |                    0 |
|  0%   29C    P8              21W / 300W |      0MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A40                     On  | 00000000:D1:00.0 Off |                    0 |
|  0%   31C    P8              22W / 300W |      0MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A40                     On  | 00000000:D5:00.0 Off |                    0 |
|  0%   29C    P8              21W / 300W |      0MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
zeng-zc commented 3 weeks ago

--mem-fraction-static 0.05 too small? It's the static gpu memory size for cache

smallstepman commented 3 weeks ago

I tried range of values, anything between 0.9 till 0.01.

keep in mind 0.5B_AWQ is about 700Mb in size, that’s around 1.5% of memory available on A40

zeng-zc commented 3 weeks ago
RuntimeError: Not enough memory. Please try to increase --mem-fraction-static.
--mem-fraction-static MEM_FRACTION_STATIC
The fraction of the memory used for static allocation
(model weights and KV cache memory pool).

The kv cache is also contained in the mem-fraction-static. I think the log gives clear hint:

RuntimeError: Not enough memory. Please try to increase --mem-fraction-static.
smallstepman commented 3 weeks ago

The purpose of me going to low-low values, like 0.01, is simply to demonstrate the two extremes in range of values:

You could try any other value: 0.02, 0.03, 0.035, 0.04, 0.2, 0.4, 0.8 etc and you'd still end up with either of these two errors.

This means there is no valid value of --mem-fraction-static that I can choose to make it work. Therefore, the error msg is misleading cause the error relates to something other than the value of --mem-fraction-static.


I'm no expert in anything that's happening under the hood, but after taking a second look at the logs, the error is possibly related to the quantization used by the model (AWQ): _C::gptq_marlin_gemm() Expected a value of type '__torch__.torch.classes._core_C.ScalarType (of Python compilation unit at: 0)' for argument '_7' but instead found type 'FakeScriptObject'.


Btw, I had to delete significant chunk of error logs from error # 1, cause GitHub was complaining about length of the message. The deleted portion was replaced with ...

yileld commented 3 weeks ago

It seems that AWQ model cant use cuda graph, I tried several weeks ago, as I turned off cuda graph when using quant model in my code.

smallstepman commented 3 weeks ago

I have no problem running python -m sglang.launch_server --host 0.0.0.0 --port 30000 --model-path Qwen/Qwen2.5-72B-Instruct-AWQ --tp 2 --dp 1 --enable-p2p-check --mem-fraction-static 0.8 (so cuda graph enabled), but once I add --enable-torch-compile it errors out

merrymercy commented 2 weeks ago

The reason is that torch.compile is not compatible with awq or gptq. It is unrelated to data parallelism, cuda graph, or other things.

merrymercy commented 2 weeks ago

We will work with torchao team (cc @jerryzh168) to make all of them compatible with each other soon.