Open ro99 opened 3 days ago
Hi @ro99 thanks for asking. May I ask which prefill chunk size did you use? As the error message has suggested, it might be the prefill chunk size being too large. So one possibility is to try
--overrides "prefill_chunk_size=2048;tensor_parallel_shards=4"
Hi @MasterJH5574 , thank you for checking this. I tried with different sizes for prefill_chunk_size but no luck, same error.
I will test with other models and let you know how it goes.
❓ General Questions
I am trying to serve a model using 4 GPUs but I keep getting the following error:
The model is the Qwen2.5-Coder-32B-Instruct that have ~61GB in size. I believe it should fit:
I am trying with many different overrides combination, but no luck. The last that I tried is the following:
I am running this on Debian 12. The quantization used is the q0f16 with --tensor-parallel-shards 4 option.