Open gmonair opened 2 weeks ago
Hi! Could you share the script you're running to quantize the model?
Sure, I'm using a pretty straightforward script that matches the examples:
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from llmcompressor.transformers import oneshot
from datasets import load_dataset
from llmcompressor.transformers import SparseAutoModelForCausalLM
from transformers import AutoTokenizer
NUM_CALIBRATION_SAMPLES=1024
MAX_SEQUENCE_LENGTH=4096
recipe = [
SmoothQuantModifier(smoothing_strength=0.8),
GPTQModifier(scheme="W8A16", targets="Linear", ignore=["lm_head"], dampening_frac=0.1),
]
model = SparseAutoModelForCausalLM.from_pretrained(
MODEL_ID, device_map="auto", torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
ds = load_dataset("AI-MO/NuminaMath-TIR", split='train')
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
def preprocess(example):
return {"text": tokenizer.apply_chat_template([{"role":"system", "content":"Please integrate natural language reasoning with programs to solve the problem above, and put your final answer within \\boxed{}."}, *example["messages"]], tokenize=False,)}
ds = ds.map(preprocess)
def tokenize(sample):
return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
ds = ds.map(tokenize, remove_columns=ds.column_names)
oneshot(
model=model,
dataset=ds,
recipe=recipe,
output_dir=OUTPUT_DIR,
max_seq_length=4096,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)
Where MODEL_ID is qwen2.5-math-72b-instruct.
For reference, this is what I see in nvtop
Hi @gmonair
Instead of passing in "auto" for the device map, could you pass in the output of calculate_offload_device_map
?
Example:
from llmcompressor.transformers.compression.helpers import calculate_offload_device_map
device_map = calculate_offload_device_map(
MODEL_ID,
reserve_for_hessians=True,
num_gpus=torch.cuda.device_count(),
trust_remote_code=True,
)
This is will ensure that cpu_offloading is applied properly during oneshot and all the available gpus are used.
Hi, while quantizing large models (qwen 72b) on 5x A40 GPUs, I noticed that only the first GPU seems to show high (80-90%) utilisation, while the rest sit at 0%. Is this something normal, or am I missing some config flags? The process works ok and the model works fine afterwards, just wondering if this is normal, or if I can increase the speed somehow. Thanks!