vllm-project / llm-compressor

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
Apache License 2.0
682 stars 58 forks source link

Only one GPU shows usage while quantizing #890

Open gmonair opened 2 weeks ago

gmonair commented 2 weeks ago

Hi, while quantizing large models (qwen 72b) on 5x A40 GPUs, I noticed that only the first GPU seems to show high (80-90%) utilisation, while the rest sit at 0%. Is this something normal, or am I missing some config flags? The process works ok and the model works fine afterwards, just wondering if this is normal, or if I can increase the speed somehow. Thanks!

dsikka commented 1 week ago

Hi! Could you share the script you're running to quantize the model?

gmonair commented 1 week ago

Sure, I'm using a pretty straightforward script that matches the examples:

from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from llmcompressor.transformers import oneshot
from datasets import load_dataset
from llmcompressor.transformers import SparseAutoModelForCausalLM
from transformers import AutoTokenizer

NUM_CALIBRATION_SAMPLES=1024
MAX_SEQUENCE_LENGTH=4096

recipe = [
    SmoothQuantModifier(smoothing_strength=0.8),
    GPTQModifier(scheme="W8A16", targets="Linear", ignore=["lm_head"], dampening_frac=0.1),
]

model = SparseAutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map="auto", torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

ds = load_dataset("AI-MO/NuminaMath-TIR", split='train')
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))

def preprocess(example):
    return {"text": tokenizer.apply_chat_template([{"role":"system", "content":"Please integrate natural language reasoning with programs to solve the problem above, and put your final answer within \\boxed{}."}, *example["messages"]], tokenize=False,)}

ds = ds.map(preprocess)

def tokenize(sample):
    return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)

ds = ds.map(tokenize, remove_columns=ds.column_names)

oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    output_dir=OUTPUT_DIR,
    max_seq_length=4096,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)

Where MODEL_ID is qwen2.5-math-72b-instruct.

gmonair commented 1 week ago

Screenshot from 2024-11-05 11-59-13

For reference, this is what I see in nvtop

dsikka commented 1 week ago

Hi @gmonair Instead of passing in "auto" for the device map, could you pass in the output of calculate_offload_device_map?

Example:

from llmcompressor.transformers.compression.helpers import calculate_offload_device_map

device_map = calculate_offload_device_map(
    MODEL_ID,
    reserve_for_hessians=True,
    num_gpus=torch.cuda.device_count(),
    trust_remote_code=True,
)

This is will ensure that cpu_offloading is applied properly during oneshot and all the available gpus are used.