vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
22.15k stars 3.12k forks source link

[Usage]: Vllm AutoAWQ with 4-GPU doesnt utilize GPU #4744

Open danielstankw opened 1 month ago

danielstankw commented 1 month ago

Your current environment

...

How would you like to use vllm

I have downloaded a model. Now on my 4 GPU instance I attempt to quantize it using AutoAWQ. Whenever I run the script below I get 0% GPU utilization. Can anyone assist why can this be happening?

import json
from huggingface_hub import snapshot_download
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
import os

# some other code here
# ////////////////
# some code here

# Load model
model = AutoAWQForCausalLM.from_pretrained(args.model_path, device_map="auto", **{"low_cpu_mem_usage": True})
tokenizer = AutoTokenizer.from_pretrained(args.model_path, trust_remote_code=True)

# Load quantization config from file
if args.quant_config:
    quant_config = json.loads(args.config)
else:
    # Default quantization config
    print("Using default quantization config")
    quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}

# Quantize
print("Quantizing the model")
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model and tokenizer
if args.quant_path:
    print("Saving the model")
    model.save_quantized(args.quant_path)
    tokenizer.save_pretrained(args.quant_path)
else:
    print("No quantized model path provided, not saving quantized model.")

image image

iwaitu commented 4 weeks ago

try this:

deepspeed_config = {
    "train_batch_size": 4,
    "gradient_accumulation_steps": 4,
    "zero_optimization": {
        "stage": 4
    },
    "fp16": {
        "enabled": True
    }
}
accelerator = Accelerator(mixed_precision='fp16', deepspeed_plugin=deepspeed_plugin)

model = AutoAWQForCausalLM.from_pretrained(output_model_path,torch_dtype=torch.float16.,device_map="auto")
model = accelerator.prepare(model)
model.quantize(tokenizer, quant_config=quant_config)
if accelerator.is_main_process:
    model.save_quantized("./"+quant_path, safetensors=True)
    tokenizer.save_pretrained("./"+quant_path)