[Bug] Llama-3.1-70B-Instruct-q3f16_1-MLC model running across two GPUs with tensor_parallel_shards=2

shahizat commented 3 weeks ago

Grettings to all

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

python3 -m mlc_llm serve HF://mlc-ai/Llama-3.1-70B-Instruct-q3f16_1-MLC --overrides "tensor_parallel_shards=2"

Output error: ValueError: The linear dimension 16384 has 409 groups under group size 40. The groups cannot be evenly distributed on 2 GPUs. Possible solutions: reduce the number of GPUs, or use quantization with a smaller group size.

Is it possible to run a 3-bit version of the MLC-LLM model using multiple GPUs?

Thanks in advance!

Hzfengsy commented 3 weeks ago

q3 might not suitable for tensor_parallel :(

MasterJH5574 commented 2 weeks ago

Hi @shahizat, as the error message has suggested, under 3-bit quantization we cannot divide groups evenly by half and thus for this case it is not supported.

mlc-ai / mlc-llm

[Bug] Llama-3.1-70B-Instruct-q3f16_1-MLC model running across two GPUs with tensor_parallel_shards=2 #3004

🐛 Bug

To Reproduce