Closed Vatshank closed 6 months ago
cc @SunMarc as well!
cc @younesbelkada and @SunMarc this seems like a bug
Hi @Vatshank, thanks for reporting. The scales are indeed not replaced in transformers integration. @casper-hansen, is there a case where the scales are not all ones when we have ScaledActivation or is it safe to not replace the activation ?
@SunMarc We replace the MLP layer with the ScaledActivation layer, mostly on models that have the GELU function. This is to increase accuracy when quantizing. I would expect the model to produce garbage output or at least have lower accuracy if you do not replace the layer with the scales that were found during quantization.
This is how we initialize in AutoAWQ: Replacing the MLP layer before loading weights.
@staticmethod
def _scale_activations(self, layer):
scale_dict = self.get_act_for_scaling(layer)
if scale_dict["is_scalable"]:
if not isinstance(scale_dict["scale_layer"], ScaledActivation):
param = next(layer.parameters())
# get activation scale
scale_like = torch.ones(
scale_dict["scale_shape"], dtype=param.dtype, device=param.device
)
# scale activation
scaled_act = ScaledActivation(scale_dict["scale_layer"], scale_like)
set_op_by_name(layer, scale_dict["scale_name"], scaled_act)
https://github.com/casper-hansen/AutoAWQ/blob/main/awq/models/base.py#L564
Thanks for the explanation @casper-hansen ! This should be solved with the above PR @Vatshank .
System Info
transformers
version: 4.36.0Who can help?
Hi @younesbelkada, I am loading an AutoAWQ checkpoint for the model casperhansen/falcon-7b-awq and I get the following warning -
Warning -
This warning is shown when the activation in the MLP block is a GELU, but not for models with RELU (like opt-awq) or gated activations (like llama2-awq). AutoAWQ handles this GELU case separately by creating a
ScaledActivation
wrapper around GELU that does the scaling inside the wrapper.This separate dedicated scaling seems to not be needed for RELU or GatedActivations because we can fuse this scaling operation with the weights of the first linear layer of the MLP --
There doesn't seem to be a convenient option like that for GELU (because of it being non-linear).
*It's possible I am missing some implementation detail, but do you think this scaling needs to be added in when loading
AutoAWQ
checkpoints intransformers
by reading the various `transformer.h..mlp.act.scales` values shown in the warning above?**When reading the same model (
casperhansen/falcon-7b-awq
) withAutoAWQ
like below, we see the activation type is actuallyScaledActivation
--OTOH, the
scales
inside theScaledActivation
are all1
s for this model (effectively making no difference whether scaling is done or not) so maybe it was a deliberate decision to ignore the scales in thetransformers
's AWQ integration? But I guess there is no guarantee that the scales with always be1
s and we might have to handle this case separately?Sorry for the long post, and please let me know if this belongs someplace else instead of a GH issue. Thanks!
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Transformers --
Output --
AutoAWQ --
Output --
Expected behavior
Expect the transformers AWQ integration to also have some form of smoothing/scaling for the second linear layer in the MLP block when GELU activation is used.