vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.38k stars 4.6k forks source link

inference with AWQ quantization #3348

Open Kev1ntan opened 8 months ago

Kev1ntan commented 8 months ago

Hi, i got an anomaly while inference mistral with AWQ, below is the GPU usage on 3090 consume 20GB GPU. even if we inference the base model only consume 19GB GPU

Screen Shot 2024-03-12 at 18 11 45

here is the command: python -m vllm.entrypoints.openai.api_server --model ../Mistral-AWQ --disable-log-requests --port 9000 --host 127.0.0.1 --max-num-seqs 500 --max-model-len 27000 --quantization awq

can anyone help?, thank you.

rkooo567 commented 8 months ago

Maybe cuda graph? If you use eager_force=True, does it still consume the same amount of memory?

Kev1ntan commented 8 months ago

Maybe cuda graph? If you use eager_force=True, does it still consume the same amount of memory?

i will try later when my instance active again

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!