vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
23.76k stars 3.41k forks source link

[Misc]: Min thread limitation inconsistency for gptq_marlin #6244

Open HandH1998 opened 2 weeks ago

HandH1998 commented 2 weeks ago

Anything you want to discuss about vllm.

For gptq_marlin, min_thread_n=64 min_thread_k=64 is required in https://github.com/vllm-project/vllm/blob/70c232f85a9e83421a4d9ca95e6384364271f2bc/csrc/quantization/gptq_marlin/gptq_marlin.cuh#L22-L23, while min_thread_n=64 min_thread_k=128 is required in https://github.com/vllm-project/vllm/blob/70c232f85a9e83421a4d9ca95e6384364271f2bc/vllm/model_executor/layers/quantization/utils/marlin_utils.py#L21-L22. Why the limitation is different?

robertgshaw2-neuralmagic commented 2 weeks ago

@alexm-neuralmagic

alexm-neuralmagic commented 2 weeks ago

@HandH1998 min_thread_k == 64 should work ok. I think we just forgot to change it to 64 in the python file. A quick way to check is to change it to 64 and run test_marlin_gemm.py test suite.