Open HiddenPeak opened 3 days ago
Hi, this feature is not supported yet on xpu, we will see if we can support this feature.
+1 to this, was just thinking about it earlier on today. Went to set it up and realised that it's not supported on the XPU backend. Would massively speed up model performance. Thanks!
this feature is very useful for my applications. Ipex serve qwen32b-int8,too slow to use on 4 Arc 770 cards. I hope that I can follow your update, Testting and use it.
this feature is very useful for my applications. Ipex serve qwen32b-int8,too slow to use on 4 Arc 770 cards. I hope that I can follow your update, Testting and use it.
What is your metric for 'too slow'? I run Qwen32b on 2 Arc A770's and I get around 22t/s on text generation and very very fast inference speeds (hundreds of tokens per second) with a 10240 context window.
4 Arc 770,2.4Tokens/s Plex8756 x16 pcie3.0 Qwen32B-int8
my target is qwen72-int4,but I can not run it with the ipex-serve docker.
It stop convert int4...,every time.😭
this feature is very useful for my applications. Ipex serve qwen32b-int8,too slow to use on 4 Arc 770 cards. I hope that I can follow your update, Testting and use it.
What is your metric for 'too slow'? I run Qwen32b on 2 Arc A770's and I get around 22t/s on text generation and very very fast inference speeds (hundreds of tokens per second) with a 10240 context window.
yes,2Arc770 is very fast.
4 Arc 770,2.4Tokens/s Plex8756 x16 pcie3.0 Qwen32B-int8
my target is qwen72-int4,but I can not run it with the ipex-serve docker.
It stop convert int4...,every time.😭
Honestly, I would attempt to get the AWQ variant of it, then use the load-in-low-bit type asym_int4. Been working very well for me :)
When I setting Speculative decoding via ipex vllm docker contariner , It show me this :