Open bonuschild opened 11 months ago
I've re-tested this on A100 instead of RTX3060, it show that finally it occupy about 20+GB VRAM! Why is that? I use command:
python api_server.py --model path/to/7b-awq/model --port 8000 -q awq --dtype half --trust-remote-code
That was so weired...
I had success running Mistral-7B-v0.1-AWQ
and CodeLlama-7B-AWQ
of TheBloke on an A6000 with 48G VRAM, restricted to ~8G VRAM with the following parameters:
python api_server.py --model path/to/model --port 8000 --quantization awq --dtype float16 --gpu-memory-utilization 0.167 --max-model-len 4096 --max-num-batched-tokens 4096
nvidia-smi
then shows around 8G memory consumed by the python process, should run on the 3060 as well I hope (need to omit the --gpu-memory-utilization obviously).
Looking for help from 2 communities 😄 thx!