Closed LOCKhart07 closed 4 months ago
Please provide the OS, CUDA version, CPU, CPU RAM, GPU(s), GPU VRAM sizes, command line you started the vLLM with, model used, prompt(s) and the full vLLM log output for diagnosis.
It may also be useful to know where did you run that Docker container.
Do you have a way to reproduce the problem with a locally running Docker container?
If it is a random issue, then how frequent is it?
OS: Centos 7 Cuda version: 12.2 CPU: AMD EPYC 7V13 64-Core Processor CPU RAM: 432G GPU: 2x NVIDIA A100 80GB
command used: python -m vllm.entrypoints.openai.api_server --model /data/LLAMA_CODE/codellama/CodeLlama-13b-Instruct/hf --tensor-parallel-size 1 --port 80 --host 0.0.0.0 --served-model codellama-13b-instruct --gpu-memory-utilization 0.75
model: codellama-13b-instruct . Also has happened once with llama-2-13b-chat
Can't provide with prompts and the full vllm output as that has confidential information
Wasn't able to reproduce the issue manually.
Please try to limit the vLLM process to a certain set of CPU cores using taskset.
Ideally you will find a set of cores which works without crashing. It will serve as a workaround and helps narrowing down the issue further.
Ideally you will need to try only some "logical" combinations, based on your CPU's architecture. Please see cat /proc/cpuinfo
or search for more documentation on that CPU. You can start only from a few cores, then increase the number from there if it is stable until it crashes. Or you can try halving the cores and narrow down. Either way should work.
If no stable set of CPU exists (other than a single core), then the problem is somewhere else.
It crashed again. Just before the crash, GPU KV cache usage was around 18% again
We had this issue as well with vLLM version 0.1.7. Please install the latest version -- it should fix it.
Closing this issue as stale as there has been no discussion in the past 3 months.
If you are still experiencing the issue you describe, feel free to re-open this issue.
Running in a docker container. After this, the subsequent api requests returned 'Internal server error'