Closed yzlnew closed 1 month ago
Following fix would run, but the model output is abnormal
--- a/csrc/quantization/marlin/sparse/marlin_24_cuda_kernel.cu
+++ b/csrc/quantization/marlin/sparse/marlin_24_cuda_kernel.cu
@@ -903,8 +903,8 @@ void marlin_cuda_2_4(const void* A, const void* B, const void* meta, void* C,
thread_k = 64;
thread_m = 256;
} else {
- thread_k = 32;
- thread_m = 512;
+ thread_k = 64;
+ thread_m = 256;
}
}
@alexm-neuralmagic is this a valid change?
There is this code inside marlin_24_cuda_kernel.cu:
if (thread_k == -1 || thread_m == -1) {
if (prob_n <= 16) {
// For small batchizes, better partitioningif is slightly more important
// than better compute utilization
thread_k = 128;
thread_m = 128;
} else if (prob_n <= 256) {
thread_k = 64;
thread_m = 256;
} else {
thread_k = 32;
thread_m = 512;
}
}
Force it use the case:
thread_k = 64; thread_m = 256;
The prob_n that is checked in the if conditions is the batch size (or number of sequences), you can set else if (prob_n <= 256) {
to else if (true || prob_n <= 256) {
@alexm-neuralmagic
I change the code to
if (thread_k == -1 || thread_m == -1) {
if (prob_n <= 16) {
// For small batchizes, better partitioningif is slightly more important
// than better compute utilization
thread_k = 128;
thread_m = 128;
} else if (true || prob_n <= 256) {
thread_k = 64;
thread_m = 256;
} else {
thread_k = 32;
thread_m = 512;
}
}
But the predictions still fill with "!"
"0": {
"origin_prompt": "<|im_start|>user\nThere is a single choice question about college biology. Answer the question by replying A, B, C or D.\nQuestion: Which of the following represents an accurate statement concerning arthropods?\nA. They possess an exoskeleton composed primarily of peptidoglycan.\nB. They possess an open circulatory system with a dorsal heart.\nC. They are members of a biologically unsuccessful phylum incapable of exploiting diverse habitats and nutrition sources.\nD. They lack paired, jointed appendages.\nAnswer: <|im_end|>\n<|im_start|>assistant\nB\n<|im_end|>\n<|im_start|>user\nThere is a single choice question about college biology. Answer the question by replying A, B, C or D.\nQuestion: In a given population, 1 out of every 400 people has a cancer caused by a completely recessive allele, b. Assuming the population is in Hardy-Weinberg equilibrium, which of the following is the expected proportion of individuals who carry the b allele but are not expected to develop the cancer?\nA. 1/400\nB. 19/400\nC. 20/400\nD. 38/400\nAnswer: <|im_end|>\n<|im_start|>assistant\nD\n<|im_end|>\n<|im_start|>user\nThere is a single choice question about college biology. Answer the question by replying A, B, C or D.\nQuestion: The presence of homologous structures in two different organisms, such as the humerus in the front limb of a human and a bird, indicates that\nA. the human and bird are polyphyletic species\nB. a human's and bird's evolution is convergent\nC. the human and bird belong to a clade\nD. the human and bird developed by analogy\nAnswer: <|im_end|>\n<|im_start|>assistant\nC\n<|im_end|>\n<|im_start|>user\nThere is a single choice question about college biology. Answer the question by replying A, B, C or D.\nQuestion: According to the pressure-flow model of movement of phloem contents, photosynthate movement from source to sink is driven by\nA. an ATP-dependent pressure-flow pump\nB. a water-pressure potential gradient\nC. transpiration\nD. apoplastic diffusion\nAnswer: <|im_end|>\n<|im_start|>assistant\nB\n<|im_end|>\n<|im_start|>user\nThere is a single choice question about college biology. Answer the question by replying A, B, C or D.\nQuestion: Which of the following contain DNA sequences required for the segregation of chromosomes in mitosis and meiosis?\nA. Telomeres\nB. Centromeres\nC. Nucleosomes\nD. Spliceosomes\nAnswer: <|im_end|>\n<|im_start|>assistant\nB\n<|im_end|>\n<|im_start|>user\nThere is a single choice question about college biology. Answer the question by replying A, B, C or D.\nQuestion: Based on the characteristic population curves that result from plotting population growth of a species, the most effective means of controlling the mosquito population is to\nA. maintain the population at a point corresponding to the midpoint of its logistic curve\nB. opt for zero population control once the K value of the curve has been reached\nC. reduce the carrying capacity cif the environment to lower the K value\nD. increase the mortality rate\nAnswer: <|im_end|>\n<|im_start|>assistant\n",
"prediction": "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!",
"gold": "C"
},
GPTQ fp16 related issue in Qwen repo: https://github.com/QwenLM/Qwen2/issues/315, https://github.com/QwenLM/Qwen2/issues/381 In cases of precision overflow, the model will output a series of exclamation marks.
Furthermore, I tested the 0.5B 2:4 sparse GPTQ model. Using vLLM 0.5.3.post1 to test MMLU, it can generate normally, but the version I built myself can only generate exclamation marks.
Sorry, I got it wrong. The stage_finetuning
model can generate normally, but the stage_quantization
model can't.
Closing this issue. For requests related to kernel support, please open the issue in vllm-project/vllm
Describe the bug Can't inference a Qwen2-72B 2:4 sparse and GPTQ model using vLLM. Should I manually lower the thread_m to 256 to fix the shape mismatch?
To Reproduce Exact steps to reproduce the behavior:
and inference using
vLLM 0.5.3.post1
.Errors
RuntimeError: prob_m = 59136 is not divisible by thread_m = 512
Additional context Add any other context about the problem here. Also include any relevant files.