vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
28.89k stars 4.29k forks source link

[Usage]: When using vllm to start the internvl2-8b(16GB) model service at A10 card(24GB), an error occurs. The command is as follows: python -m vllm.entrypoints.openai.api_serve --model ./internvl2-8b --dtype auto --gpu-memory-utilization 0.9 --trust-remote-code usage #9578

Closed hyyuananran closed 2 hours ago

hyyuananran commented 2 hours ago

Your current environment

The output of `python collect_env.py` ```text Your output of `python collect_env.py` here ```

Model Input Dumps

No response

🐛 Describe the bug

orch.OutOfMemoryError: Error in model execution (input dumped to /tmp/err_executemodel.input_20241022-033658.pkl): CUDA out of memory. tried to allocate 1.93 G to o has a total capseity of 22.19 GiB of which 183.88 MiB 1s free. Process 3759893 has 22.00 G1B memory in use. of the allocated memory 19.99 Gi8 is allocated by pytar ad 4 6 1 x of 1yorch but unallocated. if reserved but unallocated memory is large try setting PYTORCH_CUDA ALLOC CONF=expandable segments:True to av o gtation.see documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda. html#environment-variables)

Before submitting a new issue...

DarkLight1337 commented 2 hours ago

Remember that it's not just the model that takes up GPU memory - data passed into the model also eats up memory. You can set max_model_len and/or max_num_seqs to a smaller value to avoid OOM.

hyyuananran commented 2 hours ago

Thank you very much for your two clarifications, which have resolved my confusion.