huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.71k stars 26.94k forks source link

Unused smoothing scales when loading AutoAWQ checkpoints #29421

Closed Vatshank closed 6 months ago

Vatshank commented 8 months ago

System Info

Who can help?

Hi @younesbelkada, I am loading an AutoAWQ checkpoint for the model casperhansen/falcon-7b-awq and I get the following warning -

import transformers
quant_path = "casperhansen/falcon-7b-awq"
model_hf = transformers.AutoModelForCausalLM.from_pretrained(quant_path)

Warning -

Some weights of the model checkpoint at casperhansen/falcon-7b-awq were not used when initializing RWForCausalLM: ['transformer.h.4.mlp.act.scales', 'transformer.h.22.mlp.act.scales', 'transformer.h.14.mlp.act.scales', 'transformer.h.12.mlp.act.scales', 'transformer.h.5.mlp.act.scales', 'transformer.h.31.mlp.act.scales', 

'transformer.h.15.mlp.act.scales', 'transformer.h.9.mlp.act.scales', 'transformer.h.2.mlp.act.scales', 'transformer.h.13.mlp.act.scales', 'transformer.h.10.mlp.act.scales', 'transformer.h.7.mlp.act.scales', 

'transformer.h.3.mlp.act.scales', 'transformer.h.0.mlp.act.scales', 'transformer.h.16.mlp.act.scales', 'transformer.h.23.mlp.act.scales', 'transformer.h.6.mlp.act.scales', 'transformer.h.17.mlp.act.scales', 

'transformer.h.25.mlp.act.scales', 'transformer.h.29.mlp.act.scales', 'transformer.h.26.mlp.act.scales', 'transformer.h.28.mlp.act.scales', 'transformer.h.18.mlp.act.scales', 'transformer.h.20.mlp.act.scales', 

'transformer.h.21.mlp.act.scales', 'transformer.h.27.mlp.act.scales', 'transformer.h.8.mlp.act.scales', 'transformer.h.24.mlp.act.scales', 'transformer.h.30.mlp.act.scales', 'transformer.h.11.mlp.act.scales', 'transformer.h.19.mlp.act.scales', 'transformer.h.1.mlp.act.scales']
- This IS expected if you are initializing RWForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RWForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

This warning is shown when the activation in the MLP block is a GELU, but not for models with RELU (like opt-awq) or gated activations (like llama2-awq). AutoAWQ handles this GELU case separately by creating a ScaledActivation wrapper around GELU that does the scaling inside the wrapper.

This separate dedicated scaling seems to not be needed for RELU or GatedActivations because we can fuse this scaling operation with the weights of the first linear layer of the MLP --

There doesn't seem to be a convenient option like that for GELU (because of it being non-linear).

*It's possible I am missing some implementation detail, but do you think this scaling needs to be added in when loading AutoAWQ checkpoints in transformers by reading the various `transformer.h..mlp.act.scales` values shown in the warning above?**

When reading the same model (casperhansen/falcon-7b-awq) with AutoAWQ like below, we see the activation type is actually ScaledActivation --

from awq import AutoAWQForCausalLM
quant_path = "casperhansen/falcon-7b-awq"

model = AutoAWQForCausalLM.from_quantized(quant_path)

In [11]: print(model.model.transformer.blocks[0].mlp)
Out[11]:
MLP(
  (dense_h_to_4h): WQLinear_GEMM(in_features=4544, out_features=18176, bias=False, w_bit=4, group_size=64)
  (act): ScaledActivation(
    (act): GELU(approximate='none')
  )
  (dense_4h_to_h): WQLinear_GEMM(in_features=18176, out_features=4544, bias=False, w_bit=4, group_size=64)
)

OTOH, the scales inside the ScaledActivation are all 1s for this model (effectively making no difference whether scaling is done or not) so maybe it was a deliberate decision to ignore the scales in the transformers's AWQ integration? But I guess there is no guarantee that the scales with always be 1s and we might have to handle this case separately?

Out[12]:
Parameter containing:
tensor([1., 1., 1.,  ..., 1., 1., 1.], device='cuda:7', dtype=torch.float16,
       requires_grad=True)

Sorry for the long post, and please let me know if this belongs someplace else instead of a GH issue. Thanks!

Information

Tasks

Reproduction

Transformers --

import transformers
quant_path = "casperhansen/falcon-7b-awq"
model_hf = transformers.AutoModelForCausalLM.from_pretrained(quant_path)

print(model_hf.transformer.h[0].mlp.act)

Output --

GELU(approximate='none')

AutoAWQ --

from awq import AutoAWQForCausalLM
quant_path = "casperhansen/falcon-7b-awq"
model = AutoAWQForCausalLM.from_quantized(quant_path)

print(model.model.transformer.blocks[0].mlp.act)
print(model.model.transformer.blocks[0].mlp.act.scales)

Output --

ScaledActivation(
  (act): GELU(approximate='none')
)

Parameter containing:
tensor([1., 1., 1.,  ..., 1., 1., 1.], device='cuda:7', dtype=torch.float16,
       requires_grad=True)

Expected behavior

Expect the transformers AWQ integration to also have some form of smoothing/scaling for the second linear layer in the MLP block when GELU activation is used.

ArthurZucker commented 8 months ago

cc @SunMarc as well!

ArthurZucker commented 7 months ago

cc @younesbelkada and @SunMarc this seems like a bug

SunMarc commented 7 months ago

Hi @Vatshank, thanks for reporting. The scales are indeed not replaced in transformers integration. @casper-hansen, is there a case where the scales are not all ones when we have ScaledActivation or is it safe to not replace the activation ?

casper-hansen commented 7 months ago

@SunMarc We replace the MLP layer with the ScaledActivation layer, mostly on models that have the GELU function. This is to increase accuracy when quantizing. I would expect the model to produce garbage output or at least have lower accuracy if you do not replace the layer with the scales that were found during quantization.

This is how we initialize in AutoAWQ: Replacing the MLP layer before loading weights.

    @staticmethod
    def _scale_activations(self, layer):
        scale_dict = self.get_act_for_scaling(layer)

        if scale_dict["is_scalable"]:
            if not isinstance(scale_dict["scale_layer"], ScaledActivation):
                param = next(layer.parameters())

                # get activation scale
                scale_like = torch.ones(
                    scale_dict["scale_shape"], dtype=param.dtype, device=param.device
                )

                # scale activation
                scaled_act = ScaledActivation(scale_dict["scale_layer"], scale_like)
                set_op_by_name(layer, scale_dict["scale_name"], scaled_act)

https://github.com/casper-hansen/AutoAWQ/blob/main/awq/models/base.py#L564

SunMarc commented 7 months ago

Thanks for the explanation @casper-hansen ! This should be solved with the above PR @Vatshank .