vllm-project / llm-compressor

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
Apache License 2.0
644 stars 52 forks source link

SmoothQuant doesn't work with cpu offloading #107

Closed anmarques closed 2 weeks ago

anmarques commented 2 months ago

Describe the bug When using a SmoothQuantModifier and cpu offloading there is a conflict of tensors not being on the right device.

Expected behavior cpu offloading should work w/ SmoothQuant 😄

Environment I don't think environment is relevant.

To Reproduce

from transformers import AutoTokenizer
from datasets import load_dataset
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers.compression.helpers import custom_offload_device_map

model_id = "meta-llama/Meta-Llama-3.1-405B-Instruct"

num_samples = 512
max_seq_len = 4096
num_gpus = 8
max_memory_per_gpu = "20GB"

tokenizer = AutoTokenizer.from_pretrained(model_id)

def preprocess_fn(example):
  return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}

ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
ds = ds.shuffle().select(range(num_samples))
ds = ds.map(preprocess_fn)

recipe = [
  SmoothQuantModifier(smoothing_strength=0.8),
  GPTQModifier(
    sequential=True,
    targets="Linear",
    scheme="W8A8",
    ignore=["lm_head"],
    dampening_frac=0.01,
    observer="mse"
  )
]

device_map = custom_offload_device_map(
  model_id, 
  max_memory_per_gpu=max_memory_per_gpu,
  num_gpus=num_gpus, 
  torch_dtype="auto",
)

model = SparseAutoModelForCausalLM.from_pretrained(
  model_id,
  device_map="auto",
)

oneshot(
  model=model,
  dataset=ds,
  recipe=recipe,
  max_seq_length=max_seq_len,
  num_calibration_samples=num_samples,
)

model.save_pretrained("Meta-Llama-3.1-405B-Instruct-quantized.w8a8")

Errors

2024-08-18T11:40:15.797525+0000 | _apply_smoothing | INFO - Smoothing activation scales...
Traceback (most recent call last):
  File "/root/.clearml/venvs-builds/3.10/code/queue_llmcompressor_oneshot.py", line 253, in <module>
    oneshot(
  File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/transformers/finetune/text_generation.py", line 76, in oneshot
    main(model_args, data_args, training_args)
  File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/transformers/finetune/text_generation.py", line 364, in main
    stage_runner.one_shot()
  File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/transformers/finetune/runner.py", line 171, in one_shot
    self.trainer.one_shot(calibration_data=calib_data, stage=stage)
  File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/transformers/finetune/session_mixin.py", line 401, in one_shot
    apply(
  File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/core/session_functions.py", line 184, in apply
    return active_session().apply(
  File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/core/session.py", line 210, in apply
    self.initialize(**kwargs)
  File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/core/session.py", line 156, in initialize
    mod_data = self._lifecycle.initialize(
  File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/core/lifecycle.py", line 126, in initialize
    data = mod.initialize(state=self.state, **extras)
  File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/modifiers/stage.py", line 124, in initialize
    modifier.initialize(state, **kwargs)
  File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/modifiers/modifier.py", line 118, in initialize
    initialized = self.on_initialize(state=state, **kwargs)
  File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/modifiers/smoothquant/base.py", line 135, in on_initialize
    self._apply_smoothing(state.model)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/modifiers/smoothquant/base.py", line 282, in _apply_smoothing
    scales = self._calculate_smoothing_scales(balance_layers, activation_scales)
  File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/llmcompressor/modifiers/smoothquant/base.py", line 332, in _calculate_smoothing_scales
    scales = activation_scales.pow(self.smoothing_strength) / weight_scales.pow(
  File "/usr/local/lib/python3.10/dist-packages/torch/_prims_common/wrappers.py", line 266, in _fn
    result = fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_refs/__init__.py", line 1171, in div
    return true_divide(a, b)
  File "/usr/local/lib/python3.10/dist-packages/torch/_prims_common/wrappers.py", line 266, in _fn
    result = fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_prims_common/wrappers.py", line 138, in _fn
    result = fn(**bound.arguments)
  File "/usr/local/lib/python3.10/dist-packages/torch/_refs/__init__.py", line 1044, in _ref
    output = prim(a, b)
  File "/usr/local/lib/python3.10/dist-packages/torch/_refs/__init__.py", line 1738, in true_divide
    return prims.div(a, b)
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 667, in __call__
    return self_._op(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_library/abstract_impl.py", line 95, in meta_kernel
    return abstract_impl_holder.kernel(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_library/utils.py", line 20, in __call__
    return self.func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/library.py", line 788, in inner
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_library/custom_ops.py", line 471, in fake_impl
    return self._abstract_fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_prims/__init__.py", line 383, in _prim_elementwise_meta
    utils.check_same_device(*args_, allow_cpu_scalar_tensors=True)
  File "/usr/local/lib/python3.10/dist-packages/torch/_prims_common/__init__.py", line 742, in check_same_device
    raise RuntimeError(msg)

Additional context I met this issue when quantizing Meta-Llama-3.1-405B-Instruct but I'm sure there's no need to use such a large model to reproduce it.

fengyang95 commented 2 months ago

model = SparseAutoModelForCausalLM.from_pretrained( model_id, device_map="auto", )

It seems that you did not pass the correct device_map here.

device_map = custom_offload_device_map(
  model_id, 
  max_memory_per_gpu=max_memory_per_gpu,
  num_gpus=num_gpus, 
  torch_dtype="auto",
)
model = SparseAutoModelForCausalLM.from_pretrained(
  model_id,
  device_map=device_map,
)
markurtz commented 2 weeks ago

Closing this out due to lack of activity. Please reopen if you are still hitting the issue!