Closed pseudotensor closed 3 months ago
with --context-length=4096
also same problem. Can't be right.
Please check again with my PR: https://github.com/sgl-project/sglang/pull/487 and vllm 0.4.3 to check if issue is resolved. Maybe the issue is resolved somewhere here and/or vllm since your last report. I have tested multi-gpu loading and did not see obvious regression in vram usage but again under different env and diff model/gpu.
This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.
i
Please check again with my PR: #487 and vllm 0.4.3 to check if issue is resolved. Maybe the issue is resolved somewhere here and/or vllm since your last report. I have tested multi-gpu loading and did not see obvious regression in vram usage but again under different env and diff model/gpu.
i cannot even load a int3 qwen-72b model with 50G available ,which ususally only takes up only 37G memory with vllm .it sucks
After updating to latest main from March 24 version of main, I can no longer run 72b without some kind of OOM.
then
Always leads now to errors below. I also tried
--mem-fraction-static=0.9
or--mem-fraction-static=0.99
and latter gets through further but then fails later still. Before I didn't have this option set at all and was working.failure with
--mem-fraction-static=0.9
:0.98 or 0.99:
no option: