Open vinnitu opened 1 year ago
Maybe I wrong but think memory usage we can calculate as 7,3*4 ~30G, but in nvidia-smi show 72G
after
python3 -m ochat.serving.openai_api_server --model berkeley-nest/Starling-LM-7B-alpha
We can control this?
Because vLLM pre-allocates memory as KV cache. You can use python3 -m ochat.serving.openai_api_server --help to check the arguments to control the preallocation behavior.
python3 -m ochat.serving.openai_api_server --help
Maybe I wrong but think memory usage we can calculate as 7,3*4 ~30G, but in nvidia-smi show 72G
after
We can control this?