vllm-project / llm-compressor

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
Apache License 2.0
386 stars 28 forks source link

Mixtral 8*22B Quantization Failed with 2 issues #35

Closed qingquansong closed 3 weeks ago

qingquansong commented 1 month ago

Describe the bug A clear and concise description of what the bug is.

Hey Team, trying to quantize mistral 8*22b with W8A8 recipe and failed with two issues with different versions:

1)

File "/home/jobuser/llm-compressor/src/llmcompressor/utils/pytorch/module.py", line 166, in get_layers
    return match_layers_params(targets, module)
  File "/home/jobuser/llm-compressor/src/llmcompressor/utils/pytorch/module.py", line 160, in match_layers_params
    raise ValueError(f"Could not find targets {missed} in module {module}")
ValueError: Could not find targets ['re:.*gate_proj'] in module MixtralForCausalLM(
  (model): MixtralModel(
    (embed_tokens): Embedding(32768, 6144)
    (layers): ModuleList(
      (0-55): 56 x MixtralDecoderLayer(
        (self_attn): MixtralSdpaAttention(
          (q_proj): Linear(in_features=6144, out_features=6144, bias=False)
          (k_proj): Linear(in_features=6144, out_features=1024, bias=False)
          (v_proj): Linear(in_features=6144, out_features=1024, bias=False)
          (o_proj): Linear(in_features=6144, out_features=6144, bias=False)
          (rotary_emb): MixtralRotaryEmbedding()
        )
        (block_sparse_moe): MixtralSparseMoeBlock(
          (gate): Linear(in_features=6144, out_features=8, bias=False)
          (experts): ModuleList(
            (0-7): 8 x MixtralBlockSparseTop2MLP(
              (w1): Linear(in_features=6144, out_features=16384, bias=False)
              (w2): Linear(in_features=16384, out_features=6144, bias=False)
              (w3): Linear(in_features=6144, out_features=16384, bias=False)
              (act_fn): SiLU()
            )
          )
        )
        (input_layernorm): MixtralRMSNorm()
        (post_attention_layernorm): MixtralRMSNorm()
      )
    )
    (norm): MixtralRMSNorm()
  )
  (lm_head): Linear(in_features=6144, out_features=32768, bias=False)
)

This issue happen when using the latest main branch and I think there're some regex issue and previously when using the main branch 1~2 week ago I didn't see the issue. Any things has changed?

I think I can fix with changing the default mapping of smoothquant to the mixtral one manually,but wondering if there're some better solutions here and why previously it does not happen.

2) Previously although didn't face this issue, there's another OOM issue happen after 3-4 layers. (I'm changing device map to auto and seems to work well for llama3 70b but not mistral 8*22b (larger though)) So probably have a cpu offloading or better block clean up schema is needed.

Expected behavior A clear and concise description of what you expected to happen. Expect to finish with 1node 8A100 setup

Environment Include all relevant environment information:

  1. OS [e.g. Ubuntu 18.04]: Linux Mariner
  2. Python version [e.g. 3.7]: 3.10
  3. LLM Compressor version or commit hash [e.g. 0.1.0, f7245c8]: latest branch (end of 2024-0723)
  4. ML framework version(s) [e.g. torch 1.7.1]: torch 2.3.1 cu118
  5. Other Python package versions [e.g. SparseZoo, DeepSparse, numpy, ONNX]:
  6. Other relevant environment information [e.g. hardware, CUDA version]: cuda11.8

To Reproduce Exact steps to reproduce the behavior:

recipe = [
    SmoothQuantModifier(smoothing_strength=0.8),
    GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]),
]
tokenized_ids_dataset = Dataset.from_dict(tokenized_ids)
oneshot(model=model_path,  # here use mitral 8*22b
    dataset=tokenized_ids_dataset,   # I'm using `garage-bAInd___open-platypus` with 8 sequences
    recipe=recipe,
    save_compressed=True,
    output_dir=output_model_path,
    oneshot_device="auto",
    overwrite_output_dir=True,
    max_seq_length=model_max_length,
    num_calibration_samples=num_calibration_samples,
)

Errors If applicable, add a full print-out of any errors or exceptions that are raised or include screenshots to help explain your problem.

2024-07-24T06:48:58.529779+0000 | intialize_model_from_path | WARNING - Moving /shared/public/models/Mixtral-8x22B-Instruct-v0.1 to device auto for One-Shot
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████| 59/59 [01:50<00:00,  1.87s/it]
2024-07-24T06:51:22.896705+0000 | _check_create_state | INFO - State created for compression lifecycle
2024-07-24T06:51:22.898697+0000 | pre_initialize_structure | INFO - Compression lifecycle structure pre-initialized for 0 modifiers
2024-07-24T06:51:22.899583+0000 | pre_initialize_structure | INFO - Compression lifecycle structure pre-initialized for 0 modifiers
2024-07-24T06:51:22.913111+0000 | one_shot | INFO - *** One Shot ***
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
2024-07-24T06:51:35.520914+0000 | from_modifiers | INFO - Creating recipe from modifiers
/home/jobuser/.local/lib/python3.10/site-packages/pydantic/main.py:364: UserWarning: Pydantic serializer warnings:
  Expected `tuple[any, ...]` but got `list` - serialized value may not be as expected
  Expected `tuple[any, ...]` but got `list` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(
2024-07-24T06:51:35.524997+0000 | create_instance | WARNING - Could not process input as a file path or zoo stub, attempting to process it as a string.
2024-07-24T06:51:35.600929+0000 | _check_compile_recipe | INFO - Recipe compiled and 1 modifiers created
Traceback (most recent call last):
  File "/home/jobuser/test_large_quant.py", line 35, in <module>
    oneshot(model=model_path,
  File "/home/jobuser/llm-compressor/src/llmcompressor/transformers/finetune/text_generation.py", line 76, in oneshot
    main(model_args, data_args, training_args)
  File "/home/jobuser/llm-compressor/src/llmcompressor/transformers/finetune/text_generation.py", line 358, in main
    stage_runner.one_shot()
  File "/home/jobuser/llm-compressor/src/llmcompressor/transformers/finetune/runner.py", line 157, in one_shot
    self.trainer.one_shot(calib_data, stage=stage)
  File "/home/jobuser/llm-compressor/src/llmcompressor/transformers/finetune/session_mixin.py", line 399, in one_shot
    apply(
  File "/home/jobuser/llm-compressor/src/llmcompressor/core/session_functions.py", line 184, in apply
    return active_session().apply(
  File "/home/jobuser/llm-compressor/src/llmcompressor/core/session.py", line 210, in apply
    self.initialize(**kwargs)
  File "/home/jobuser/llm-compressor/src/llmcompressor/core/session.py", line 156, in initialize
    mod_data = self._lifecycle.initialize(
  File "/home/jobuser/llm-compressor/src/llmcompressor/core/lifecycle.py", line 126, in initialize
    data = mod.initialize(state=self.state, **extras)
  File "/home/jobuser/llm-compressor/src/llmcompressor/modifiers/stage.py", line 124, in initialize
    modifier.initialize(state, **kwargs)
  File "/home/jobuser/llm-compressor/src/llmcompressor/modifiers/modifier.py", line 118, in initialize
    initialized = self.on_initialize(state=state, **kwargs)
  File "/home/jobuser/llm-compressor/src/llmcompressor/modifiers/smoothquant/base.py", line 127, in on_initialize
    self.resolved_mappings_ = self._resolve_mappings(state.model)
  File "/home/jobuser/llm-compressor/src/llmcompressor/modifiers/smoothquant/base.py", line 184, in _resolve_mappings
    _, balance_layer = get_matching_layer(
  File "/home/jobuser/llm-compressor/src/llmcompressor/utils/pytorch/module.py", line 311, in get_matching_layer
    potential_matches = get_layers(target, module)
  File "/home/jobuser/llm-compressor/src/llmcompressor/utils/pytorch/module.py", line 166, in get_layers
    return match_layers_params(targets, module)
  File "/home/jobuser/llm-compressor/src/llmcompressor/utils/pytorch/module.py", line 160, in match_layers_params
    raise ValueError(f"Could not find targets {missed} in module {module}")
ValueError: Could not find targets ['re:.*gate_proj'] in module MixtralForCausalLM(
  (model): MixtralModel(
    (embed_tokens): Embedding(32768, 6144)
    (layers): ModuleList(
      (0-55): 56 x MixtralDecoderLayer(
        (self_attn): MixtralSdpaAttention(
          (q_proj): Linear(in_features=6144, out_features=6144, bias=False)
          (k_proj): Linear(in_features=6144, out_features=1024, bias=False)
          (v_proj): Linear(in_features=6144, out_features=1024, bias=False)
          (o_proj): Linear(in_features=6144, out_features=6144, bias=False)
          (rotary_emb): MixtralRotaryEmbedding()
        )
        (block_sparse_moe): MixtralSparseMoeBlock(
          (gate): Linear(in_features=6144, out_features=8, bias=False)
          (experts): ModuleList(
            (0-7): 8 x MixtralBlockSparseTop2MLP(
              (w1): Linear(in_features=6144, out_features=16384, bias=False)
              (w2): Linear(in_features=16384, out_features=6144, bias=False)
              (w3): Linear(in_features=6144, out_features=16384, bias=False)
              (act_fn): SiLU()
            )
          )
        )
        (input_layernorm): MixtralRMSNorm()
        (post_attention_layernorm): MixtralRMSNorm()
      )
    )
    (norm): MixtralRMSNorm()
  )
  (lm_head): Linear(in_features=6144, out_features=32768, bias=False)
)`

Another OOM issue cannot find the log but happens after 4 layers

Additional context Add any other context about the problem here. Also include any relevant files.

Satrat commented 1 month ago

Hey @qingquansong, thanks for trying out llm-compressor!

To address your first issue with SmoothQuantModifier, we recently added a default mapping that gets used if none is provided (see https://github.com/vllm-project/llm-compressor/blob/main/src/llmcompressor/modifiers/smoothquant/base.py#L16). You'll need to set the mappings argument in the modifier manually to work with Mistral

As for your OOM issue, the mappings set by device_map="auto" do not take into account the hessians allocated during GPTQ. To reduce the memory usage you can add the sequential_update=True argument to GPTQModifier. This will run GPTQ layer by layer and only the hessians for a single transformers layer will be saved at a time.

Additionally, we have support for CPU offloading currently in PR: https://github.com/vllm-project/llm-compressor/pull/34. This also adds support for accounting for the GPTQ and quantization memory needs on model load.

qingquansong commented 1 month ago

@Satrat Thank you for the response! One quick question I have is: for the mappings I saw the default one is defined as:

DEFAULT_SMOOTHQUANT_MAPPINGS = [
    [["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"],
    [["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"],
]

how we decide the two layers of the mappings (I can understand why we put qkv together but something I'm confused is that why we have to based on the block to put the "re:.*input_layernorm" in the second layer of this first list but put the others in a separate list, can we just do

[["re:.q_proj", "re:.k_proj", "re:.v_proj"], "re:.input_layernorm", ["re:.gate_proj", "re:.up_proj"], "re:.*post_attention_layernorm" ] and no need to separate a list in the second line?

)

and if using the mixtral as an example, should we do the following or maybe need to separate something or merge some lists here?

[
    [["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*o_proj"],
    [["re:.*w1", "re:.*w3"],  "re:.*w2", "re:.*input_layernorm", "re:.*post_attention_layernorm"]
    ]

And it seems if there's another issue even I do:

from llmcompressor.modifiers.smoothquant.base import DEFAULT_SMOOTHQUANT_MAPPINGS

recipe = [
    SmoothQuantModifier(smoothing_strength=0.8, mappings=DEFAULT_SMOOTHQUANT_MAPPINGS),
    GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"], sequential_update=True),
]

Traceback (most recent call last):
  File "/home/jobuser/test_large_quant.py", line 43, in <module>
    oneshot(model=model_path,
  File "/home/jobuser/llm-compressor/src/llmcompressor/transformers/finetune/text_generation.py", line 76, in oneshot
    main(model_args, data_args, training_args)
  File "/home/jobuser/llm-compressor/src/llmcompressor/transformers/finetune/text_generation.py", line 358, in main
    stage_runner.one_shot()
  File "/home/jobuser/llm-compressor/src/llmcompressor/transformers/finetune/runner.py", line 157, in one_shot
    self.trainer.one_shot(calib_data, stage=stage)
  File "/home/jobuser/llm-compressor/src/llmcompressor/transformers/finetune/session_mixin.py", line 399, in one_shot
    apply(
  File "/home/jobuser/llm-compressor/src/llmcompressor/core/session_functions.py", line 184, in apply
    return active_session().apply(
  File "/home/jobuser/llm-compressor/src/llmcompressor/core/session.py", line 210, in apply
    self.initialize(**kwargs)
  File "/home/jobuser/llm-compressor/src/llmcompressor/core/session.py", line 156, in initialize
    mod_data = self._lifecycle.initialize(
  File "/home/jobuser/llm-compressor/src/llmcompressor/core/lifecycle.py", line 120, in initialize
    extras = self.recipe_container.update(**extras)
  File "/home/jobuser/llm-compressor/src/llmcompressor/recipe/container.py", line 75, in update
    recipe = Recipe.create_instance(recipe)
  File "/home/jobuser/llm-compressor/src/llmcompressor/recipe/recipe.py", line 114, in create_instance
    return cls.from_modifiers(
  File "/home/jobuser/llm-compressor/src/llmcompressor/recipe/recipe.py", line 71, in from_modifiers
    return cls.create_instance(path_or_modifiers=recipe_string)
  File "/home/jobuser/llm-compressor/src/llmcompressor/recipe/recipe.py", line 126, in create_instance
    obj = _load_json_or_yaml_string(path_or_modifiers)
  File "/home/jobuser/llm-compressor/src/llmcompressor/recipe/recipe.py", line 601, in _load_json_or_yaml_string
    raise ValueError(f"Could not parse recipe from string {content}") from err
ValueError: Could not parse recipe from string DEFAULT_stage:
  DEFAULT_modifiers:
    SmoothQuantModifier:
      index: null
      group: null
      start: -1
      end: -1
      update: null
      initialized_structure_: false
      initialized_: false
      finalized_: false
      started_: false
      ended_: false
      smoothing_strength: 0.8
      mappings:
      - !!python/tuple
        - - re:.*q_proj
          - re:.*k_proj
          - re:.*v_proj
        - re:.*input_layernorm
      - !!python/tuple
        - - re:.*gate_proj
          - re:.*up_proj
        - re:.*post_attention_layernorm
      ignore: null
      num_calibration_steps: null
      calibration_function: null
      hooks_: null
      resolved_mappings_: null
      scales_: null
    GPTQModifier:
      index: null
      group: null
      start: -1
      end: -1
      update: null
      initialized_structure_: false
      initialized_: false
      finalized_: false
      started_: false
      ended_: false
      sequential_update: true
      targets: Linear
      block_size: 128
      quantize: true
      dampening_frac: 0.01
      config_groups: null
      ignore:
      - lm_head
      disable_quantization_observer_epoch: null
      num_calibration_steps: null
      scheme: W8A8
      model: null
      layer_compressors_: null
      compressible_layers_: null
      quantization_modifier_: null

Thank you!

Satrat commented 1 month ago

The mappings should be of the form: ([weights_to_balance...], activation_to_smooth]). For instance in the default: we want to smooth the input activations that feed into q/k/v proj, and these activations are coming out of input_layernorm. Same for the second item in the list, we want to smooth the activations coming into gate/up proj, and they come out of post_attention_layernorm. The diagram on page 4 of https://arxiv.org/pdf/2211.10438 is useful for visualizing this.

The Mixtral example suggested would not work, as the mapping must be formatted in groups of 2, where activation_to_smooth is a single element and [weights_to_balance] can be a list

qingquansong commented 1 month ago

@Satrat Thank you!

1) Since I directly import DEFAULT_SMOOTHQUANT_MAPPINGS, any reason why the default one also not work? (just want to the test the mapping format)

from llmcompressor.modifiers.smoothquant.base import DEFAULT_SMOOTHQUANT_MAPPINGS

recipe = [
    SmoothQuantModifier(smoothing_strength=0.8, mappings=DEFAULT_SMOOTHQUANT_MAPPINGS),
    GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"], sequential_update=True),
]

2) As mixtral structure is as follows, so seems the order is a bit mismatched compared to the figure shows to adopt smooth quant?

    (layers): ModuleList(
      (0-55): 56 x MixtralDecoderLayer(
        (self_attn): MixtralSdpaAttention(
          (q_proj): Linear(in_features=6144, out_features=6144, bias=False)
          (k_proj): Linear(in_features=6144, out_features=1024, bias=False)
          (v_proj): Linear(in_features=6144, out_features=1024, bias=False)
          (o_proj): Linear(in_features=6144, out_features=6144, bias=False)
          (rotary_emb): MixtralRotaryEmbedding()
        )
        (block_sparse_moe): MixtralSparseMoeBlock(
          (gate): Linear(in_features=6144, out_features=8, bias=False)
          (experts): ModuleList(
            (0-7): 8 x MixtralBlockSparseTop2MLP(
              (w1): Linear(in_features=6144, out_features=16384, bias=False)
              (w2): Linear(in_features=16384, out_features=6144, bias=False)
              (w3): Linear(in_features=6144, out_features=16384, bias=False)
              (act_fn): SiLU()
            )
          )
        )
        (input_layernorm): MixtralRMSNorm()
        (post_attention_layernorm): MixtralRMSNorm()
      )
    )
    (norm): MixtralRMSNorm()
  ) 
Satrat commented 1 month ago

Interesting, this is an issue with the recipe parsing. I was able to make a barebones example that reproduced that issue and filed is as a bug here: https://github.com/vllm-project/llm-compressor/issues/37. To unblock, you can try defining the recipe as a string or yaml file rather than programmatically (explained in the linked issue)

I also checked with our research team, and running smoothquant on mixtral is not something we have tried before. So unfortunately I don't have a specific recipe to suggest. I think the general flow of matching up each linear layer with the activation before it is a good place to start. There also is no requirement that every activation be smoothed, so for instance you could try only smoothing into q/k/v/o and gate and leaving the experts as is

qingquansong commented 1 month ago

Make sense. Thank you for the quick response! Let me explore a bit and can share some finding/experience or provide a recipe for Mixtral later if needed.

qingquansong commented 1 month ago

Hey @Satrat , I'm able to make smoothquant and GPTQ work for mixtral 87b with the following recipe (shared here for other users as reference) and waiting for 8 22b + sequential_update: True to finish, currently didn't see OOM issue and memory usage with 1 node 8 A100 (80GB GPU) is around max 50G per gpu, which is good:

quantization_stage:
  run_type: oneshot
  quantization_modifiers:
    SmoothQuantModifier:
      smoothing_strength: 0.8
      mappings: [
        [["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"],
        [["re:.*w1", "re:.*w3"], "re:.*post_attention_layernorm"]
      ]
      ignore: null
    GPTQModifier:
      sequential_update: True
      dampening_frac: 0.01
      block_size: 128
      config_groups:
        group_0:
          targets:
            - "Linear"
          input_activations: null
          output_activations: null
          weights:
            num_bits: 8
            type: "int"
            symmetric: true
            strategy: "tensor"
            group_size: 128
Satrat commented 1 month ago

@qingquansong great! One correction though is SmoothQuant should always come before the GPTQModifier; you won't see an error applying it afterwards but you won't get any benefit. The intended usage is to run smoothquant to "squish" the outliers, then run quantization/gptq afterwards with the squashed dynamic range

qingquansong commented 1 month ago

oh @Satrat yeah, you're absolutely right, sorry forget that. Btw, I faced some issues related to the hessian matrix compute ill-conditioned. I think it's partially due to (1) the number of samples I used is small (2) the dampening_frac (3) I think if we could add the act order, it will help a lot based on our previous analysis for GPTQ. I remember there's a pr (branch) for checking it, do we plan to add it soon? Should be fairly easy.

Satrat commented 1 month ago

Yeah, for an MoE model especially the amount of calibration data will be an issue. As for activation reordering, its currently being worked on: @bfineran do we have an estimate on when that PR will be ready?

qingquansong commented 1 month ago

Hey @Satrat , I think there seems to be another issue here, I follow the sample yaml here https://github.com/vllm-project/llm-compressor/blob/29cb10da1b8fd6ef5f2112e980de0cabea62a0c9/src/llmcompressor/modifiers/quantization/gptq/base.py#L60-L61 to define strategy as "tensor":

|                    strategy: "tensor"
|                    group_size: 128

but it should be "group" or "channel" I guess? From the introduction here https://github.com/vllm-project/llm-compressor/blob/29cb10da1b8fd6ef5f2112e980de0cabea62a0c9/examples/quantization_24_sparse_w4a16/README.md#custom-quantization when providing the group_size, it seems should be group, but since w8A8 in vllm validate requires tensor or channel to replace layers here: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py#L129-L130 maybe it does not support group option as the cutlass kernel does not support it now?

Or maybe we should and one option to make wNa16 accepted for the first case that input_quant is None for some layers so we can keep the strategy as "tensor" for the w8a8 case.

Another thing that is weird is when I use the previously yaml file to quantize the model, when loading the layer, it gives me all input_quant is None and only weight_quant is weight_quant num_bits=8 type=<QuantizationType.INT: 'int'> symmetric=True group_size=128 strategy=<QuantizationStrategy.TENSOR: 'tensor'> block_structure=None dynamic=False observer='minmax' observer_kwargs={} in vllm. Is it because Smooth quant is not applied correctly?

Changing to channel the input_quant is still None when loading with vllm though all layers passed the check, but when loading weights some params has params.weight_loader=None causing the issue here: https://github.com/vllm-project/vllm/blob/v0.5.2/vllm/model_executor/models/mixtral_quant.py#L413

qingquansong commented 1 month ago

I probably know the reason and wanna confirm with you (also doing testing myself now). Several things to change: (1) cannot use group option for w8a8 and only channel and tensor are available so need to remove group size and change to channel or tensor. (2) input_activation should be changed to 8bits rather than None (3) need to add ignore lm_head in GPTQ (smooth quant is fine) (4) I added gate in mappings but since it's only for moe routing, maybe it's better not add it to protect performance, but since the hidden states is shared for q/k/v and gate so to make sure things are all 8 bits I add it in it now. (5) Not sure if output_activation needs to be 8bits but I feel we don't need to. Llama works and mixtral still testing it.

quantization_stage:
  run_type: oneshot
  quantization_modifiers:
    SmoothQuantModifier:
      smoothing_strength: 0.8
      mappings: [
        [["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"],
        [["re:.*w1", "re:.*w3"], "re:.*post_attention_layernorm"]
      ]
      ignore: null
    GPTQModifier:
      sequential_update: True
      dampening_frac: 0.01
      block_size: 128
      ignore: ["lm_head"]
      config_groups:
        group_0:
          targets:
            - "Linear"
          input_activations: 
            num_bits: 8
          output_activations: null
          weights:
            num_bits: 8
            type: "int"
            symmetric: true
            strategy: "channel"
qingquansong commented 1 month ago

The model can be quantized for moe but still have issue will vllm 0.5.2 loading. Since I saw @robertgshaw2-neuralmagic has made some recent changes in that, maybe know the root cause? I tested llama and the weight loader is:

name model.layers.9.self_attn.o_proj.input_scale
weight_loader <bound method RowParallelLinear.weight_loader of RowParallelLinear(input_features=4096, output_features=4096, bias=False, tp_size=1, reduce_results=True)>

However, for mixtral with the same quantization setup, I got:

name model.layers.23.block_sparse_moe.experts.4.w1.input_scale
weight_loader None`

I think it's related to the mixtral_quant.py script in vllm has some issues of setting the weight loader in this case so the param here still have the attributed named weight_loader but it's None. Even I set it to default loader, it will have some issues after loading during inference, so maybe should fix it and add the correct weight loader to it. I'll take a look into it. 🤔

Satrat commented 1 month ago

Hey @qingquansong thanks for the update! To address your questions

  1. You should be able to use group quantization for weights, but it isn't recommended to use it for inputs. For input activations we recommend either per-tensor (the default) or dynamic per-token (strategy: token and dynamic: true)
  2. Input activations are optional to quantize, things should still run fine if for weight only quantization
  3. Yes, for vllm compatibility you'll need to ignore the lm_head. Although lm_head quantization is something we plan on adding in the future for vllm.
  4. There is a restriction that parameters that are combined in vllm (q/k/v for instance) must have the same scale and zero point. However, this doesn't have any affect on the smoothquant mappings since they are applied before quantization, and the model is exported with the smoothed weights
  5. There is no need to quantize the output activations and it is not currently supported in vllm

For the vLLM side issues I'll pass it over to @robertgshaw2-neuralmagic :)

qingquansong commented 1 month ago

Thanks for the repsponse! Let me take a look and vllm issue I think I figure out why, I put a quick fix here: https://github.com/vllm-project/vllm/pull/6793 @Satrat @robertgshaw2-neuralmagic please help take a look and see if it makes sense.

qingquansong commented 1 month ago

@Satrat I think probably there're some misunderstanding here. I didn't set group for input I think (please correct me if I setup wrong)?Only gptq one has the group size setup currently. If I don't set

          input_activations: 
            num_bits: 8

then the input_quant will be None when loading with vllm 0.5.2 even I use smooth quant, causing issues with vllm loading. Also, the group option seems also blocks the vllm loading in this case since there's no one available for w8a8 from here . The reason i want to quantize both is to adopt W8A8 tensor core for faster inference. I changed to channel and remove the group_size which works fine then. For weight loading issue, after fix in the vllm pr, https://github.com/vllm-project/vllm/pull/6793 it seems to be fine to load and run, but not sure about the accuracy and mistral 8*22b become much slower in this case and requires more memory usage. I'm not sure if it's related to the channel wise quantization, but ideally it should be much faster since I only have prefill stage with 1 token generation right?

(Quick update:for the speed issue,it should because of mixtral_quant in vllm does not have fused moe layer for int8 quantization,I'll create one )

Satrat commented 1 month ago

So any of the quantization options for weights are also available for input_activations and output_activations. You can see all of the options here: https://github.com/neuralmagic/compressed-tensors/blob/main/src/compressed_tensors/quantization/quant_args.py.

As for the vLLM side, I didn't realize we didn't have grouped weight support for w8a8 yet. @robertgshaw2-neuralmagic is working on documenting what is and isn't supported, which should hopefully clear up confusion in the future

qingquansong commented 1 month ago

Update from my side: I corrected my setup to remove "re:.*gate", from smooth quant, since for MoE 8*22b it only has 8 experts for each layer and current cutlass kernel used in vllm requires it to the dim to be %16 == 0 when doing int8 matmul so have to keep it for now. Also, I created a fused MoE with in8 support similar to the original one in vllm, but have to improve the speed for it currently.

A question related here is: since I do smooth quant for [["re:.*w1", "re:.*w3"], "re:.*post_attention_layernorm"] but the down projection "re:.*w2" does not do smooth quant for the input activations. Should I add something for this layer as well or maybe it will do it by default? The default one for llama seems also don't do for the downprojection one https://github.com/vllm-project/llm-compressor/blob/780256cbe5d2df36693f050bf7e2c23007b70539/src/llmcompressor/modifiers/smoothquant/base.py#L16-L19 so I'm wondering if it will still use the bf16 one to time with the int8 weight in the down projection layer. Thank you!

robertgshaw2-neuralmagic commented 1 month ago

@Satrat is this good to close?

qingquansong commented 1 month ago

@Satrat is this good to close?

Hey @robertgshaw2-neuralmagic I think we can close it now. The only thing left is maybe can look at this pr: https://github.com/vllm-project/vllm/pull/6978 related to the fused w8a8 MoE for speedup and performance issue. Other than that it's all good. One more thing related to fp8 quantization, since vllm directly support dynamic fp8 quantization (without using llm-compressor) is it suggested to use that one or maybe llm-compressor is for providing more fp8 quant strategies? I feel we probably can remove the mixtral_quant.py scripts later after we have the above pr checked in so all mixtral can use the same script. Thank you!

robertgshaw2-neuralmagic commented 1 month ago

I’m dealing with a few critical issues in v0.5.4. I will review your PR after this

even with in place wuantization, having a checkpoint is better since it’s half the disk space. So we still want to have quantized checkpoints