vllm-project / llm-compressor

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
Apache License 2.0
422 stars 33 forks source link

RuntimeError: prob_m = 59136 is not divisible by thread_m = 512 when performing inference Qwen2-72B with marlin24 #54

Closed yzlnew closed 1 month ago

yzlnew commented 1 month ago

Describe the bug Can't inference a Qwen2-72B 2:4 sparse and GPTQ model using vLLM. Should I manually lower the thread_m to 256 to fix the shape mismatch?

To Reproduce Exact steps to reproduce the behavior:

sparsity_stage:
  run_type: oneshot
  sparsity_modifiers:
    SparseGPTModifier:
      sparsity: 0.5
      mask_structure: "2:4"
      sequential_update: true
quantization_stage:
  run_type: oneshot
  quantization_modifiers:
    GPTQModifier:
      sequential_update: true
      ignore: ["lm_head"]
      config_groups:
        group_0:
          weights:
            num_bits: 4
            type: "int"
            symmetric: true
            strategy: "channel"
          targets: ["Linear"]

and inference using vLLM 0.5.3.post1.

Errors RuntimeError: prob_m = 59136 is not divisible by thread_m = 512

Additional context Add any other context about the problem here. Also include any relevant files.

yzlnew commented 1 month ago

Following fix would run, but the model output is abnormal

--- a/csrc/quantization/marlin/sparse/marlin_24_cuda_kernel.cu
+++ b/csrc/quantization/marlin/sparse/marlin_24_cuda_kernel.cu
@@ -903,8 +903,8 @@ void marlin_cuda_2_4(const void* A, const void* B, const void* meta, void* C,
       thread_k = 64;
       thread_m = 256;
     } else {
-      thread_k = 32;
-      thread_m = 512;
+      thread_k = 64;
+      thread_m = 256;
     }
   }
mgoin commented 1 month ago

@alexm-neuralmagic is this a valid change?

alexm-neuralmagic commented 1 month ago

There is this code inside marlin_24_cuda_kernel.cu:

if (thread_k == -1 || thread_m == -1) {
    if (prob_n <= 16) {
      // For small batchizes, better partitioningif is slightly more important
      // than better compute utilization
      thread_k = 128;
      thread_m = 128;
    } else if (prob_n <= 256) {
      thread_k = 64;
      thread_m = 256;
    } else {
      thread_k = 32;
      thread_m = 512;
    }
  }

Force it use the case: thread_k = 64; thread_m = 256;

The prob_n that is checked in the if conditions is the batch size (or number of sequences), you can set else if (prob_n <= 256) { to else if (true || prob_n <= 256) {

yzlnew commented 1 month ago

@alexm-neuralmagic

I change the code to

if (thread_k == -1 || thread_m == -1) {
    if (prob_n <= 16) {
      // For small batchizes, better partitioningif is slightly more important
      // than better compute utilization
      thread_k = 128;
      thread_m = 128;
    } else if (true || prob_n <= 256) {
      thread_k = 64;
      thread_m = 256;
    } else {
      thread_k = 32;
      thread_m = 512;
    }
  }

But the predictions still fill with "!"

    "0": {
        "origin_prompt": "<|im_start|>user\nThere is a single choice question about college biology. Answer the question by replying A, B, C or D.\nQuestion: Which of the following represents an accurate statement concerning arthropods?\nA. They possess an exoskeleton composed primarily of peptidoglycan.\nB. They possess an open circulatory system with a dorsal heart.\nC. They are members of a biologically unsuccessful phylum incapable of exploiting diverse habitats and nutrition sources.\nD. They lack paired, jointed appendages.\nAnswer: <|im_end|>\n<|im_start|>assistant\nB\n<|im_end|>\n<|im_start|>user\nThere is a single choice question about college biology. Answer the question by replying A, B, C or D.\nQuestion: In a given population, 1 out of every 400 people has a cancer caused by a completely recessive allele, b. Assuming the population is in Hardy-Weinberg equilibrium, which of the following is the expected proportion of individuals who carry the b allele but are not expected to develop the cancer?\nA. 1/400\nB. 19/400\nC. 20/400\nD. 38/400\nAnswer: <|im_end|>\n<|im_start|>assistant\nD\n<|im_end|>\n<|im_start|>user\nThere is a single choice question about college biology. Answer the question by replying A, B, C or D.\nQuestion: The presence of homologous structures in two different organisms, such as the humerus in the front limb of a human and a bird, indicates that\nA. the human and bird are polyphyletic species\nB. a human's and bird's evolution is convergent\nC. the human and bird belong to a clade\nD. the human and bird developed by analogy\nAnswer: <|im_end|>\n<|im_start|>assistant\nC\n<|im_end|>\n<|im_start|>user\nThere is a single choice question about college biology. Answer the question by replying A, B, C or D.\nQuestion: According to the pressure-flow model of movement of phloem contents, photosynthate movement from source to sink is driven by\nA. an ATP-dependent pressure-flow pump\nB. a water-pressure potential gradient\nC. transpiration\nD. apoplastic diffusion\nAnswer: <|im_end|>\n<|im_start|>assistant\nB\n<|im_end|>\n<|im_start|>user\nThere is a single choice question about college biology. Answer the question by replying A, B, C or D.\nQuestion: Which of the following contain DNA sequences required for the segregation of chromosomes in mitosis and meiosis?\nA. Telomeres\nB. Centromeres\nC. Nucleosomes\nD. Spliceosomes\nAnswer: <|im_end|>\n<|im_start|>assistant\nB\n<|im_end|>\n<|im_start|>user\nThere is a single choice question about college biology. Answer the question by replying A, B, C or D.\nQuestion: Based on the characteristic population curves that result from plotting population growth of a species, the most effective means of controlling the mosquito population is to\nA. maintain the population at a point corresponding to the midpoint of its logistic curve\nB. opt for zero population control once the K value of the curve has been reached\nC. reduce the carrying capacity cif the environment to lower the K value\nD. increase the mortality rate\nAnswer: <|im_end|>\n<|im_start|>assistant\n",
        "prediction": "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!",
        "gold": "C"
    },
yzlnew commented 1 month ago

GPTQ fp16 related issue in Qwen repo: https://github.com/QwenLM/Qwen2/issues/315, https://github.com/QwenLM/Qwen2/issues/381 In cases of precision overflow, the model will output a series of exclamation marks.

yzlnew commented 1 month ago

Furthermore, I tested the 0.5B 2:4 sparse GPTQ model. Using vLLM 0.5.3.post1 to test MMLU, it can generate normally, but the version I built myself can only generate exclamation marks.


Sorry, I got it wrong. The stage_finetuning model can generate normally, but the stage_quantization model can't.

robertgshaw2-neuralmagic commented 1 month ago

Closing this issue. For requests related to kernel support, please open the issue in vllm-project/vllm