TGI included marlin kernel is missing padding code (REOPEN)

System Info

TGI version

tgi-2.3.1 docker image

OS version

torch install path ............... ['/home/chatgpt/.local/lib/python3.10/site-packages/torch']
torch version .................... 2.3.1+cu121
deepspeed install path ........... ['/home/chatgpt/.local/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.14.5, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 2.3, cuda 12.1
shared memory (/dev/shm) size .... 1007.76 GB

GPU info

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off | 00000000:21:00.0 Off |                    0 |
| N/A   31C    P0              60W / 400W |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          Off | 00000000:27:00.0 Off |                    0 |
| N/A   33C    P0              59W / 400W |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          Off | 00000000:51:00.0 Off |                    0 |
| N/A   32C    P0              57W / 400W |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          Off | 00000000:56:00.0 Off |                    0 |
| N/A   30C    P0              58W / 400W |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM4-80GB          Off | 00000000:8E:00.0 Off |                    0 |
| N/A   30C    P0              58W / 400W |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM4-80GB          Off | 00000000:93:00.0 Off |                    0 |
| N/A   32C    P0              56W / 400W |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM4-80GB          Off | 00000000:CA:00.0 Off |                    0 |
| N/A   33C    P0              60W / 400W |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM4-80GB          Off | 00000000:D0:00.0 Off |                    0 |
| N/A   31C    P0              56W / 400W |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

Model being used

Deepseek-coder-V2-instruct-GPTQ quantized with GPTQModel https://github.com/ModelCloud/GPTQModel

quant script

import torch
from datasets import load_dataset
from gptqmodel import GPTQModel, QuantizeConfig
from transformers import AutoTokenizer

pretrained_model_id = "/var/mntpkg/deepseek-coder-v2-instruct"
quantized_model_id = "deepseek-coder-v2-instruct-gptq"

# pretrained_model_id = "/var/mntpkg/deepseek-llm-7b"
# quantized_model_id = "deepseek-coder-gptq-test"

# os.makedirs(quantized_model_dir, exist_ok=True)
# def get_calibdataset(tokenizer, n_samples):

#     jsonl_file_path = "openherms.parquet"
#     ds = load_dataset('parquet', data_files={'train': jsonl_file_path}, split='train')
#     samples = []
#     for sample in ds.select(range(n_samples)):
#         responses = [f'{response["role"]}: {response["content"]}' for response in sample["chosen"]]
#         samples.append("\n".join(responses))

#     examples = [tokenizer.apply_chat_template(batch, tokenize=False) for batch in samples]
#     examples = tokenizer(examples, padding=True, truncation=True, return_tensors="pt").to("cuda")
#     return examples

def get_calibdataset(tokenizer, n_samples):

    jsonl_file_path = "openherms.parquet"
    ds = load_dataset('parquet', data_files={'train': jsonl_file_path}, split='train')
    samples = []
    for sample in ds.select(range(n_samples)):
        messsage = []
        for response in sample["chosen"]:
            if response["role"] == 'user':
                messsage.append({
                    "role": "user",
                    "content": response["content"]
                })
            if response["role"] == 'assistant':
                messsage.append({
                    "role": "assistant",
                    "content": response["content"]
                })
        samples.append(messsage)

        # responses = [f'"role": {response["role"]}: {response["content"]}' for response in sample["chosen"]]
        # samples.append("\n".join(responses))

    # Apply chat template without tokenization
    examples = [tokenizer.apply_chat_template(batch,  add_generation_prompt=True, tokenize=False) for batch in samples]

    # Tokenize each example and return as a list of dictionaries
    tokenized_examples = [tokenizer(example) for example in examples]

    return tokenized_examples

@torch.no_grad()
def calculate_avg_ppl(model, tokenizer):
    from gptqmodel.utils import Perplexity

    ppl = Perplexity(
        model=model,
        tokenizer=tokenizer,
        dataset_path="wikitext",
        dataset_name="wikitext-2-raw-v1",
        split="train",
        text_column="text",
    )

    all = ppl.calculate(n_ctx=512, n_batch=512)

    # average ppl
    avg = sum(all) / len(all)

    return avg

def main():
    tokenizer = AutoTokenizer.from_pretrained(pretrained_model_id, use_fast=True)

    traindataset = get_calibdataset(tokenizer, n_samples=512)

    quantize_config = QuantizeConfig(
        bits=4,  # quantize model to 4-bit
        group_size=128,  # it is recommended to set the value to 128
    )

    # load un-quantized model, the model will always be force loaded into cpu
    model = GPTQModel.from_pretrained(pretrained_model_id, quantize_config, trust_remote_code=True)

    # quantize model, the calibration_dataset should be list of dict whose keys can only be "input_ids" and "attention_mask"
    # with value under torch.LongTensor type.
    model.quantize(traindataset)

    # save quantized model
    model.save_quantized(quantized_model_id)

    # save quantized model using safetensors
    model.save_quantized(quantized_model_id, use_safetensors=True)

    # load quantized model, currently only support cpu or single gpu
    model = GPTQModel.from_quantized(quantized_model_id, device="cuda:0")

    # inference with model.generate
    print(tokenizer.decode(model.generate(**tokenizer("test is", return_tensors="pt").to("cuda:0"))[0]))

    # print(f"Quantized Model {quantized_model_id} avg PPL is {calculate_avg_ppl(model, tokenizer)}")

if __name__ == "__main__":
    import logging

    logging.basicConfig(
        format="%(asctime)s %(levelname)s [%(name)s] %(message)s",
        level=logging.INFO,
        datefmt="%Y-%m-%d %H:%M:%S",
    )

    main()

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

docker run script

#!/bin/bash

external_port=9190
num_shard=8

model_path=deepseek-coder-v2-instruct-gptq

model_name=$(basename $model_path)

sudo docker run -d \
--gpus '"device=all"' \
--shm-size 1g \
--name $model_name \
-p ${external_port}:80 -v $model_path:/data/CmwCoder \
-e WEIGHTS_CACHE_OVERRIDE="/data/CmwCoder" \
tgi:2.3.1 \
--weights-cache-override="/data/CmwCoder" \
--model-id "/data/CmwCoder" --num-shard $num_shard \
--q gptq \
--max-input-length 14000 \
--max-total-tokens 16000 \
--max-batch-prefill-tokens 14000

error message

config = DeepseekV2Config {                                 │ │
[rank7]: │ │                         "_name_or_path":                                 │ │
[rank7]: │ │                       "/var/mntpkg/deepseek-coder-v2-instruct",          │ │
[rank7]: │ │                         "architectures": [                               │ │
[rank7]: │ │                       │   "DeepseekV2ForCausalLM"                        │ │
[rank7]: │ │                         ],                                               │ │
[rank7]: │ │                         "attention_bias": false,                         │ │
[rank7]: │ │                         "attention_dropout": 0.0,                        │ │
[rank7]: │ │                         "auto_map": {                                    │ │
[rank7]: │ │                       │   "AutoConfig":                                  │ │
[rank7]: │ │                       "configuration_deepseek.DeepseekV2Config",         │ │
[rank7]: │ │                       │   "AutoModel":                                   │ │
[rank7]: │ │                       "modeling_deepseek.DeepseekV2Model",               │ │
[rank7]: │ │                       │   "AutoModelForCausalLM":                        │ │
[rank7]: │ │                       "modeling_deepseek.DeepseekV2ForCausalLM"          │ │
[rank7]: │ │                         },                                               │ │
[rank7]: │ │                         "aux_loss_alpha": 0.001,                         │ │
[rank7]: │ │                         "bos_token_id": 100000,                          │ │
[rank7]: │ │                         "eos_token_id": 100001,                          │ │
[rank7]: │ │                         "ep_size": 1,                                    │ │
[rank7]: │ │                         "first_k_dense_replace": 1,                      │ │
[rank7]: │ │                         "hidden_act": "silu",                            │ │
[rank7]: │ │                         "hidden_size": 5120,                             │ │
[rank7]: │ │                         "initializer_range": 0.02,                       │ │
[rank7]: │ │                         "intermediate_size": 12288,                      │ │
[rank7]: │ │                         "kv_lora_rank": 512,                             │ │
[rank7]: │ │                         "max_position_embeddings": 163840,               │ │
[rank7]: │ │                         "moe_intermediate_size": 1536,                   │ │
[rank7]: │ │                         "moe_layer_freq": 1,                             │ │
[rank7]: │ │                         "n_group": 8,                                    │ │
[rank7]: │ │                         "n_routed_experts": 160,                         │ │
[rank7]: │ │                         "n_shared_experts": 2,                           │ │
[rank7]: │ │                         "norm_topk_prob": false,                         │ │
[rank7]: │ │                         "num_attention_heads": 128,                      │ │
[rank7]: │ │                         "num_experts_per_tok": 6,                        │ │
[rank7]: │ │                         "num_hidden_layers": 60,                         │ │
[rank7]: │ │                         "num_key_value_heads": 128,                      │ │
[rank7]: │ │                         "pretraining_tp": 1,                             │ │
[rank7]: │ │                         "q_lora_rank": 1536,                             │ │
[rank7]: │ │                         "qk_nope_head_dim": 128,                         │ │
[rank7]: │ │                         "qk_rope_head_dim": 64,                          │ │
[rank7]: │ │                         "quantization_config": {                         │ │
[rank7]: │ │                       │   "bits": 4,                                     │ │
[rank7]: │ │                       │   "checkpoint_format": "gptq",                   │ │
[rank7]: │ │                       │   "damp_percent": 0.005,                         │ │
[rank7]: │ │                       │   "desc_act": true,                              │ │
[rank7]: │ │                       │   "dynamic_bits": null,                          │ │
[rank7]: │ │                       │   "group_size": 128,                             │ │
[rank7]: │ │                       │   "lm_head": false,                              │ │
[rank7]: │ │                       │   "meta": {                                      │ │
[rank7]: │ │                       │     "quantizer": "gptqmodel:0.9.10-dev0"         │ │
[rank7]: │ │                       │   },                                             │ │
[rank7]: │ │                       │   "model_file_base_name": null,                  │ │
[rank7]: │ │                       │   "model_name_or_path": null,                    │ │
[rank7]: │ │                       │   "quant_method": "gptq",                        │ │
[rank7]: │ │                       │   "static_groups": false,                        │ │
[rank7]: │ │                       │   "sym": true,                                   │ │
[rank7]: │ │                       │   "true_sequential": true                        │ │
[rank7]: │ │                         },                                               │ │
[rank7]: │ │                         "quantize": "gptq",                              │ │
[rank7]: │ │                         "rms_norm_eps": 1e-06,                           │ │
[rank7]: │ │                         "rope_scaling": {                                │ │
[rank7]: │ │                       │   "beta_fast": 32,                               │ │
[rank7]: │ │                       │   "beta_slow": 1,                                │ │
[rank7]: │ │                       │   "factor": 40,                                  │ │
[rank7]: │ │                       │   "mscale": 1.0,                                 │ │
[rank7]: │ │                       │   "mscale_all_dim": 1.0,                         │ │
[rank7]: │ │                       │   "original_max_position_embeddings": 4096,      │ │
[rank7]: │ │                       │   "type": "yarn"                                 │ │
[rank7]: │ │                         },                                               │ │
[rank7]: │ │                         "rope_theta": 10000,                             │ │
[rank7]: │ │                         "routed_scaling_factor": 16.0,                   │ │
[rank7]: │ │                         "scoring_func": "softmax",                       │ │
[rank7]: │ │                         "seq_aux": true,                                 │ │
[rank7]: │ │                         "speculator": null,                              │ │
[rank7]: │ │                         "tie_word_embeddings": false,                    │ │
[rank7]: │ │                         "topk_group": 3,                                 │ │
[rank7]: │ │                         "topk_method": "group_limited_greedy",           │ │
[rank7]: │ │                         "torch_dtype": "bfloat16",                       │ │
[rank7]: │ │                         "transformers_version": "4.45.0",                │ │
[rank7]: │ │                         "use_cache": true,                               │ │
[rank7]: │ │                         "v_head_dim": 128,                               │ │
[rank7]: │ │                         "vocab_size": 102400                             │ │
[rank7]: │ │                       }                                                  │ │
[rank7]: │ │        config_class = <class                                             │ │
[rank7]: │ │                       'text_generation_server.models.custom_modeling.fl… │ │
[rank7]: │ │       default_dtype = torch.bfloat16                                     │ │
[rank7]: │ │              device = device(type='cuda', index=7)                       │ │
[rank7]: │ │               dtype = torch.float16                                      │ │
[rank7]: │ │           filenames = [PosixPath('/data/CmwCoder/model.safetensors')]    │ │
[rank7]: │ │   generation_config = GenerationConfig {                                 │ │
[rank7]: │ │                         "bos_token_id": 100000,                          │ │
[rank7]: │ │                         "do_sample": true,                               │ │
[rank7]: │ │                         "eos_token_id": 100001,                          │ │
[rank7]: │ │                         "temperature": 0.3,                              │ │
[rank7]: │ │                         "top_p": 0.95                                    │ │
[rank7]: │ │                       }                                                  │ │
[rank7]: │ │           head_size = 192                                                │ │
[rank7]: │ │      kv_cache_dtype = torch.float16                                      │ │
[rank7]: │ │    lora_adapter_ids = []                                                 │ │
[rank7]: │ │         model_class = <class                                             │ │
[rank7]: │ │                       'text_generation_server.models.custom_modeling.fl… │ │
[rank7]: │ │            model_id = '/data/CmwCoder'                                   │ │
[rank7]: │ │        num_kv_heads = None                                               │ │
[rank7]: │ │              prefix = ''                                                 │ │
[rank7]: │ │            quantize = 'gptq'                                             │ │
[rank7]: │ │                rank = 7                                                  │ │
[rank7]: │ │            revision = None                                               │ │
[rank7]: │ │                self = <text_generation_server.models.flash_causal_lm.Fl… │ │
[rank7]: │ │                       object at 0x7f0558d4b110>      

File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_deepseek_v2_modeling.py", line 562, in <listcomp>
    DeepseekV2Layer(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_deepseek_v2_modeling.py", line 496, in __init__
    self.mlp = DeepseekV2MoE(f"{prefix}.mlp", config, moe_layer_cls, weights)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_deepseek_v2_modeling.py", line 431, in __init__
    self.moe_layer = moe_layer_cls(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/layers/moe/__init__.py", line 231, in __init__
    self.moe = cls(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/layers/moe/gptq_marlin.py", line 103, in __init__
    self.down_proj = _load_expert_weights_row(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/layers/moe/gptq_marlin.py", line 168, in _load_expert_weights_row
    weight = weights.get_weights_row(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/utils/weights.py", line 391, in get_weights_row
    return self.weights_loader.get_weights_row(self, prefix)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/layers/marlin/gptq.py", line 220, in get_weights_row
    return repack_gptq_for_marlin(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/layers/marlin/gptq.py", line 321, in repack_gptq_for_marlin
    raise ValueError(
ValueError: Number of input features (192) not divisible by group size (128)

Expected behavior

Now TGI has supported for GPTQ-quantized MoE models using MoE Marlin. I still encounter some problems when I tried to deploy DeepSeekV2-gptq.

The author of GPTQModel said

Qubitium commented https://github.com/ModelCloud/GPTQModel/issues/328#issuecomment-2408339273 There is nothing wrong the quant, the TGI included marlin kernel is missing padding code, we fixed this GTPQModel, that allows it to run with models that have features not perfectly divisible by 128. TGI needs to fix it on their end or you can use our inference code.

huggingface / text-generation-inference

TGI included marlin kernel is missing padding code (REOPEN) #2662