[Usage]: When using vllm to start the internvl2-8b(16GB) model service at A10 card(24GB), an error occurs. The command is as follows: python -m vllm.entrypoints.openai.api_serve --model ./internvl2-8b --dtype auto --gpu-memory-utilization 0.9 --trust-remote-code usage

hyyuananran commented 2 hours ago

Your current environment

The output of `python collect_env.py`

```text Your output of `python collect_env.py` here ```

Model Input Dumps

No response

🐛 Describe the bug

orch.OutOfMemoryError: Error in model execution (input dumped to /tmp/err_executemodel.input_20241022-033658.pkl): CUDA out of memory. tried to allocate 1.93 G to o has a total capseity of 22.19 GiB of which 183.88 MiB 1s free. Process 3759893 has 22.00 G1B memory in use. of the allocated memory 19.99 Gi8 is allocated by pytar ad 4 6 1 x of 1yorch but unallocated. if reserved but unallocated memory is large try setting PYTORCH_CUDA ALLOC CONF=expandable segments:True to av o gtation.see documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda. html#environment-variables)

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

DarkLight1337 commented 2 hours ago

Remember that it's not just the model that takes up GPU memory - data passed into the model also eats up memory. You can set max_model_len and/or max_num_seqs to a smaller value to avoid OOM.

hyyuananran commented 2 hours ago

Thank you very much for your two clarifications, which have resolved my confusion.

vllm-project / vllm