Open HandH1998 opened 2 weeks ago
For gptq_marlin, min_thread_n=64 min_thread_k=64 is required in https://github.com/vllm-project/vllm/blob/70c232f85a9e83421a4d9ca95e6384364271f2bc/csrc/quantization/gptq_marlin/gptq_marlin.cuh#L22-L23, while min_thread_n=64 min_thread_k=128 is required in https://github.com/vllm-project/vllm/blob/70c232f85a9e83421a4d9ca95e6384364271f2bc/vllm/model_executor/layers/quantization/utils/marlin_utils.py#L21-L22. Why the limitation is different?
min_thread_n=64 min_thread_k=64
min_thread_n=64 min_thread_k=128
@alexm-neuralmagic
@HandH1998 min_thread_k == 64 should work ok. I think we just forgot to change it to 64 in the python file. A quick way to check is to change it to 64 and run test_marlin_gemm.py test suite.
Anything you want to discuss about vllm.
For gptq_marlin,
min_thread_n=64 min_thread_k=64
is required in https://github.com/vllm-project/vllm/blob/70c232f85a9e83421a4d9ca95e6384364271f2bc/csrc/quantization/gptq_marlin/gptq_marlin.cuh#L22-L23, whilemin_thread_n=64 min_thread_k=128
is required in https://github.com/vllm-project/vllm/blob/70c232f85a9e83421a4d9ca95e6384364271f2bc/vllm/model_executor/layers/quantization/utils/marlin_utils.py#L21-L22. Why the limitation is different?