Open Kev1ntan opened 8 months ago
Maybe cuda graph? If you use eager_force=True, does it still consume the same amount of memory?
Maybe cuda graph? If you use eager_force=True, does it still consume the same amount of memory?
i will try later when my instance active again
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Hi, i got an anomaly while inference mistral with AWQ, below is the GPU usage on 3090 consume 20GB GPU. even if we inference the base model only consume 19GB GPU
here is the command: python -m vllm.entrypoints.openai.api_server --model ../Mistral-AWQ --disable-log-requests --port 9000 --host 127.0.0.1 --max-num-seqs 500 --max-model-len 27000 --quantization awq
can anyone help?, thank you.