Open smallstepman opened 3 weeks ago
--mem-fraction-static 0.05
too small? It's the static gpu memory size for cache
I tried range of values, anything between 0.9 till 0.01.
keep in mind 0.5B_AWQ is about 700Mb in size, that’s around 1.5% of memory available on A40
RuntimeError: Not enough memory. Please try to increase --mem-fraction-static.
--mem-fraction-static MEM_FRACTION_STATIC The fraction of the memory used for static allocation (model weights and KV cache memory pool).
The kv cache is also contained in the mem-fraction-static. I think the log gives clear hint:
RuntimeError: Not enough memory. Please try to increase --mem-fraction-static.
The purpose of me going to low-low values, like 0.01
, is simply to demonstrate the two extremes in range of values:
0.01
- is too low, raising RuntimeError: Not enough memory
, while0.05
or anything above - is too high, raising Exception: Capture cuda graph failed
. You could try any other value: 0.02, 0.03, 0.035, 0.04, 0.2, 0.4, 0.8 etc and you'd still end up with either of these two errors.
This means there is no valid value of --mem-fraction-static
that I can choose to make it work. Therefore, the error msg is misleading cause the error relates to something other than the value of --mem-fraction-static
.
I'm no expert in anything that's happening under the hood, but after taking a second look at the logs, the error is possibly related to the quantization used by the model (AWQ): _C::gptq_marlin_gemm() Expected a value of type '__torch__.torch.classes._core_C.ScalarType (of Python compilation unit at: 0)' for argument '_7' but instead found type 'FakeScriptObject'.
Btw, I had to delete significant chunk of error logs from error # 1, cause GitHub was complaining about length of the message. The deleted portion was replaced with ...
It seems that AWQ model cant use cuda graph, I tried several weeks ago, as I turned off cuda graph when using quant model in my code.
I have no problem running python -m sglang.launch_server --host 0.0.0.0 --port 30000 --model-path Qwen/Qwen2.5-72B-Instruct-AWQ --tp 2 --dp 1 --enable-p2p-check --mem-fraction-static 0.8
(so cuda graph enabled), but once I add --enable-torch-compile
it errors out
The reason is that torch.compile is not compatible with awq or gptq. It is unrelated to data parallelism, cuda graph, or other things.
We will work with torchao team (cc @jerryzh168) to make all of them compatible with each other soon.
Checklist
Describe the bug
can't use
-enable-torch-compile
in tandem with--dp
, always reports either OOM or not enough memory (see two examples below). On purpose, I picked one of the smallest models (0.5B), and GPU with a lot of VRAM (A40 has 48gb), despite that, it still doesn't work.happy to help to hunt this down
Reproduction
1
2
Environment
host: runpod.io gpu:
8*A40
OS image: RunPod Pytorch 2.4.0runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04