Closed pseudotensor closed 4 months ago
Basically something is wrong now that was ok before. Can't even run phi-3 vision on 80GB H100 now.
Hi, thanks for the report!
Can you try reverting to 96354d6a2967a63eb5c0e56a2da2ead512ff1067 (right before 2061f0b8a7f1a01683c4045096a092eedf6387a4)? I believe #5888 may be causing the issue.
Hi @pseudotensor! This is in fact not a bug, but a fix to a previous bug in the initial Phi-3 PR that image payload was always None
instead of actual pixel values during profiling, resulting in an incorrect over-estimation for available space for KV blocks that would result in OOM when server is under max load. (though the fixed profiling is conservative itself, but we would rather keep it that way for now instead of leaving the possibility of crashing the server).
If you limit your --max-num-seqs
to a lower number (I've tested on H100 that it can go up to 17), you should still be able to launch the server with the full context-length.
I've also made #5981 to avoid this confusion.
Ok, I've misunderstood max_num_seqs then. I thought hat was a max, not a required limit. So I would have expected context length to supersede the number of sequences, and so the number of sequences to be automatically reduced to accommodate my chosen context length.
Your current environment
🐛 Describe the bug
Same launching as https://github.com/vllm-project/vllm/issues/5969
Only difference is hash 2cd402e1692417b7645e4ece11bc2ab91072f47c (latest main as of earlier today).
GPU is totally free, so just new bug in vLLM between the e9de9dd551ac595a9f3825fcd1507deceef4f332 and 2cd402e1692417b7645e4ece11bc2ab91072f47c hashes